Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Extracted JPEG data seems to end prematurely #2266

Closed
michelcrypt4d4mus opened this issue Oct 24, 2023 · 4 comments · Fixed by #2595
Closed

BUG: Extracted JPEG data seems to end prematurely #2266

michelcrypt4d4mus opened this issue Oct 24, 2023 · 4 comments · Fixed by #2595
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-regression Regression introduced as a side-effect of another change workflow-images From a users perspective, image handling is the affected feature/workflow

Comments

@michelcrypt4d4mus
Copy link

I'm not 100% sure this is a PyPDF issue though I suspect it is a regression introduced single 3.14.0 because this never used to happen in my application and now it happens quite frequently despite both the calling code and the PyTesseract package being unchanged though there's at least a small chance there's some issue in the underlying Tesseract binary.

Environment

$ python -m platform
3.11.5

$ python -c "import pypdf;print(pypdf._debug_versions)"
3.16.4

Code + PDF

The code is here, in particular these lines where a PIL.Image object is extracted from the PDF:

for image_number, image in enumerate(page.images, start=1):
    image_obj = Image.open(io.BytesIO(image.data))

produce a PIL.Image object that is passed to PyTesseract here:

text = pytesseract.image_to_string(image)

PyTesseract then fails with this:

TesseractError: (1, 'Corrupt JPEG data: premature end of data segment Error in pixReadStreamJpeg: read error at scanline 2206; nwarn = 1 Error in pixReadStreamJpeg: bad data Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Error during processing.')

PDF file

You can use the PDF file in tests.
FTX Claim SC30 01072023101624File595287144.pdf

Traceback

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/uzor/workspace/clown_sort/clown_sort/files/image_file.py:123 in extract_text     │
│                                                                                                  │
│   120 │   │   text = None                                                                        │
│   121 │   │                                                                                      │
│   122 │   │   try:                                                                               │
│ ❱ 123 │   │   │   text = pytesseract.image_to_string(image)                                      │
│   124 │   │   except pytesseract.pytesseract.TesseractError as e:                                │
│   125 │   │   │   console.print_exception()                                                      │
│   126 │   │   │   console.print(warning_text(f"Tesseract OCR failure '{image_name}'! No OCR te   │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:423 in image_to_string                               │
│                                                                                                  │
│   420 │   """                                                                                    │
│   421 │   args = [image, 'txt', lang, config, nice, timeout]                                     │
│   422 │                                                                                          │
│ ❱ 423 │   return {                                                                               │
│   424 │   │   Output.BYTES: lambda: run_and_get_output(*(args + [True])),                        │
│   425 │   │   Output.DICT: lambda: {'text': run_and_get_output(*args)},                          │
│   426 │   │   Output.STRING: lambda: run_and_get_output(*args),                                  │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:426 in <lambda>                                      │
│                                                                                                  │
│   423 │   return {                                                                               │
│   424 │   │   Output.BYTES: lambda: run_and_get_output(*(args + [True])),                        │
│   425 │   │   Output.DICT: lambda: {'text': run_and_get_output(*args)},                          │
│ ❱ 426 │   │   Output.STRING: lambda: run_and_get_output(*args),                                  │
│   427 │   }[output_type]()                                                                       │
│   428                                                                                            │
│   429                                                                                            │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:288 in run_and_get_output                            │
│                                                                                                  │
│   285 │   │   │   'timeout': timeout,                                                            │
│   286 │   │   }                                                                                  │
│   287 │   │                                                                                      │
│ ❱ 288 │   │   run_tesseract(**kwargs)                                                            │
│   289 │   │   filename = f"{kwargs['output_filename_base']}{extsep}{extension}"                  │
│   290 │   │   with open(filename, 'rb') as output_file:                                          │
│   291 │   │   │   if return_bytes:                                                               │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:264 in run_tesseract                                 │
│                                                                                                  │
│   261 │                                                                                          │
│   262 │   with timeout_manager(proc, timeout) as error_string:                                   │
│   263 │   │   if proc.returncode:                                                                │
│ ❱ 264 │   │   │   raise TesseractError(proc.returncode, get_errors(error_string))                │
│   265                                                                                            │
│   266                                                                                            │
│   267 def run_and_get_output(                                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TesseractError: (1, 'Corrupt JPEG data: premature end of data segment Error in pixReadStreamJpeg: read error at scanline 2206; nwarn = 1 Error in pixReadStreamJpeg: bad 
data Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Error during processing.')

@michelcrypt4d4mus
Copy link
Author

When I downgrade to 3.14.0 this issue goes away so I think it can be confirmed as a regression. Here's a file that was failing in 3.16.4 but working fine in 3.14.0 (also usable for tests):
FTX Claim Skybridge Capital 30062023113350File971325116.pdf

@MartinThoma MartinThoma added workflow-images From a users perspective, image handling is the affected feature/workflow is-regression Regression introduced as a side-effect of another change labels Oct 28, 2023
@MartinThoma MartinThoma changed the title Extracted JPEG data seems to end prematurely BUG: Extracted JPEG data seems to end prematurely Oct 28, 2023
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Oct 28, 2023
@pubpub-zz
Copy link
Collaborator

images from FTX Claim SC30 01072023101624File595287144.pdf :
iss2266a_images.zip

@pubpub-zz
Copy link
Collaborator

images from FTX.Claim.Skybridge.Capital.30062023113350File971325116.pdf
iss2266b_images.zip

@pubpub-zz
Copy link
Collaborator

@michelcrypt4d4mus
Can you please indicate the exact images that used to fail:
Checking all images during checks is too much time consuming

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-regression Regression introduced as a side-effect of another change workflow-images From a users perspective, image handling is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants