Bug when getting images from a pdf #2124

cylindrical2002 · 2023-08-28T14:50:57Z

Replace this: What happened? What were you trying to achieve?

Environment

Which environment were you using when you encountered the problem?

conda create -n bank python=3.10
conda activate bank
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install hanlp
pip install transformers
pip install opencc
pip install pypdf[full]

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("上海农村商业银行10.pdf")

number_of_pages = len(reader.pages)
print(number_of_pages)

# for page in reader.pages:
#     print(page.extract_text())
    
count = 0
    
for page in reader.pages:
    print(reader.pages.index(page))
    # print(page.extract_text())
    # if "/Annots" in page:
    #     for annot in page["/Annots"]:
    #         obj = annot.get_object()
    #         annotation = {"subtype": obj["/Subtype"], "location": obj["/Rect"]}
    #         print(annotation)
    if page.images is None:
        continue
    print(len(page.images))

    for image_file_object in page.images:
        # Ensure the image file object exists before attempting to use it
        if image_file_object is not None:
            with open(str(count) + image_file_object.name, "wb") as fp:
                fp.write(image_file_object.data)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

上海农村商业银行10.pdf

Traceback

This is the complete Traceback I see:

TypeError                                 Traceback (most recent call last)
Cell In[4], line 25
     22     continue
     23 print(len(page.images))
---> 25 for image_file_object in page.images:
     26     # Ensure the image file object exists before attempting to use it
     27     if image_file_object is not None:
     28         with open(str(count) + image_file_object.name, "wb") as fp:

File [f:\Anaconda\envs\bank\lib\site-packages\pypdf\_page.py:2644](file:///F:/Anaconda/envs/bank/lib/site-packages/pypdf/_page.py:2644), in _VirtualListImages.__iter__(self)
   2642 def __iter__(self) -> Iterator[ImageFile]:
   2643     for i in range(len(self)):
-> 2644         yield self[i]

File [f:\Anaconda\envs\bank\lib\site-packages\pypdf\_page.py:2640](file:///F:/Anaconda/envs/bank/lib/site-packages/pypdf/_page.py:2640), in _VirtualListImages.__getitem__(self, index)
   2638 if index < 0 or index >= len_self:
   2639     raise IndexError("sequence index out of range")
-> 2640 return self.get_function(lst[index])

File [f:\Anaconda\envs\bank\lib\site-packages\pypdf\_page.py:544](file:///F:/Anaconda/envs/bank/lib/site-packages/pypdf/_page.py:544), in PageObject._get_image(self, id, obj)
    541         raise KeyError("no inline image can be found")
    542     return self.inline_images[id]
--> 544 imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
    545 extension, byte_stream = imgd[:2]
...
    918     ):
    919         _r = int(255 * (1 - _c / 255) * (1 - _k / 255))
    920         _g = int(255 * (1 - _m / 255) * (1 - _k / 255))

TypeError: object of type 'NoneType' has no len()

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2023-08-28T14:52:35Z

This seems to be a duplicate of #2110.

cylindrical2002 · 2023-08-28T14:53:53Z

This seems to be a duplicate of #2110.

Did you find out how to solve it?

stefan6419846 · 2023-08-28T14:56:22Z

The handling for invalid lookups has to be reworked in PyPDF itself, so there is nothing you can do you from the outside (see the initial analysis in the other report). You might want to submit a corresponding PR attempting to fix this though.

cylindrical2002 · 2023-08-28T15:05:21Z

The handling for invalid lookups has to be reworked in PyPDF itself, so there is nothing you can do you from the outside (see the initial analysis in the other report). You might want to submit a corresponding PR attempting to fix this though.

I got that, thank you!
Did you find this bug in early version of PyPDF?

stefan6419846 · 2023-08-28T15:09:08Z

Did you read the linked issue? You will see that I discovered this in version 3.15.2. Looking at the code, it might have been introduced in 3.10.0 with 68e2cf0#diff-185702ddcfbf2e4a9ef7106622bb77505eacae032966bba39c65ffb9cd0f9bc7

cylindrical2002 · 2023-08-28T15:14:59Z

I got your message, thank you!

pubpub-zz · 2023-08-28T20:35:07Z

@stefan6419846 the issue is the same. good to have a public code for test.

closes py-pdf#2124 closes py-pdf#2110

Closes #2124 Closes #2110

Didi3333 · 2024-01-16T07:03:07Z

Hi, i still have the bug with some PDF with version : 3.17.4

Invalid Lookup Table in {'/Subtype': '/Image', '/Filter': '/FlateDecode', '/BitsPerComponent': 8, '/ColorSpace': IndirectObject(10963, 0, 2463122516880), '/Width': 79, '/Height': 82, '/Type': '/XObject'}

panda.pdf

stefan6419846 · 2024-01-16T08:02:06Z

Please open a new issue with all the required details (and an exact page number) instead of posting on old tickets.

Doing a quick test, this is another issue in your case:

ValueError: not enough image data

Thus it might be a duplicate of #2343, although it does not seem to be the flate handler which fails in your case.

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 28, 2023

BUG: fix image look-up table in EncodedStreamObject

9188157

closes py-pdf#2124 closes py-pdf#2110

pubpub-zz mentioned this issue Aug 28, 2023

BUG: Fix image look-up table in EncodedStreamObject #2128

Merged

MartinThoma closed this as completed in #2128 Sep 3, 2023

MartinThoma pushed a commit that referenced this issue Sep 3, 2023

BUG: Fix image look-up table in EncodedStreamObject (#2128)

af41173

Closes #2124 Closes #2110

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug when getting images from a pdf #2124

Bug when getting images from a pdf #2124

cylindrical2002 commented Aug 28, 2023

stefan6419846 commented Aug 28, 2023

cylindrical2002 commented Aug 28, 2023

stefan6419846 commented Aug 28, 2023

cylindrical2002 commented Aug 28, 2023

stefan6419846 commented Aug 28, 2023

cylindrical2002 commented Aug 28, 2023

pubpub-zz commented Aug 28, 2023

Didi3333 commented Jan 16, 2024 •

edited

Loading

stefan6419846 commented Jan 16, 2024

Bug when getting images from a pdf #2124

Bug when getting images from a pdf #2124

Comments

cylindrical2002 commented Aug 28, 2023

Environment

Code + PDF

Traceback

stefan6419846 commented Aug 28, 2023

cylindrical2002 commented Aug 28, 2023

stefan6419846 commented Aug 28, 2023

cylindrical2002 commented Aug 28, 2023

stefan6419846 commented Aug 28, 2023

cylindrical2002 commented Aug 28, 2023

pubpub-zz commented Aug 28, 2023

Didi3333 commented Jan 16, 2024 • edited Loading

stefan6419846 commented Jan 16, 2024

Didi3333 commented Jan 16, 2024 •

edited

Loading