Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when getting images from a pdf #2124

Closed
cylindrical2002 opened this issue Aug 28, 2023 · 9 comments · Fixed by #2128
Closed

Bug when getting images from a pdf #2124

cylindrical2002 opened this issue Aug 28, 2023 · 9 comments · Fixed by #2128

Comments

@cylindrical2002
Copy link

Replace this: What happened? What were you trying to achieve?

Environment

Which environment were you using when you encountered the problem?

conda create -n bank python=3.10
conda activate bank
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install hanlp
pip install transformers
pip install opencc
pip install pypdf[full]

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("上海农村商业银行10.pdf")

number_of_pages = len(reader.pages)
print(number_of_pages)

# for page in reader.pages:
#     print(page.extract_text())
    
count = 0
    
for page in reader.pages:
    print(reader.pages.index(page))
    # print(page.extract_text())
    # if "/Annots" in page:
    #     for annot in page["/Annots"]:
    #         obj = annot.get_object()
    #         annotation = {"subtype": obj["/Subtype"], "location": obj["/Rect"]}
    #         print(annotation)
    if page.images is None:
        continue
    print(len(page.images))

    for image_file_object in page.images:
        # Ensure the image file object exists before attempting to use it
        if image_file_object is not None:
            with open(str(count) + image_file_object.name, "wb") as fp:
                fp.write(image_file_object.data)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

上海农村商业银行10.pdf

Traceback

This is the complete Traceback I see:

TypeError                                 Traceback (most recent call last)
Cell In[4], line 25
     22     continue
     23 print(len(page.images))
---> 25 for image_file_object in page.images:
     26     # Ensure the image file object exists before attempting to use it
     27     if image_file_object is not None:
     28         with open(str(count) + image_file_object.name, "wb") as fp:

File [f:\Anaconda\envs\bank\lib\site-packages\pypdf\_page.py:2644](file:///F:/Anaconda/envs/bank/lib/site-packages/pypdf/_page.py:2644), in _VirtualListImages.__iter__(self)
   2642 def __iter__(self) -> Iterator[ImageFile]:
   2643     for i in range(len(self)):
-> 2644         yield self[i]

File [f:\Anaconda\envs\bank\lib\site-packages\pypdf\_page.py:2640](file:///F:/Anaconda/envs/bank/lib/site-packages/pypdf/_page.py:2640), in _VirtualListImages.__getitem__(self, index)
   2638 if index < 0 or index >= len_self:
   2639     raise IndexError("sequence index out of range")
-> 2640 return self.get_function(lst[index])

File [f:\Anaconda\envs\bank\lib\site-packages\pypdf\_page.py:544](file:///F:/Anaconda/envs/bank/lib/site-packages/pypdf/_page.py:544), in PageObject._get_image(self, id, obj)
    541         raise KeyError("no inline image can be found")
    542     return self.inline_images[id]
--> 544 imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
    545 extension, byte_stream = imgd[:2]
...
    918     ):
    919         _r = int(255 * (1 - _c / 255) * (1 - _k / 255))
    920         _g = int(255 * (1 - _m / 255) * (1 - _k / 255))

TypeError: object of type 'NoneType' has no len()
@stefan6419846
Copy link
Collaborator

This seems to be a duplicate of #2110.

@cylindrical2002
Copy link
Author

This seems to be a duplicate of #2110.

Did you find out how to solve it?

@stefan6419846
Copy link
Collaborator

The handling for invalid lookups has to be reworked in PyPDF itself, so there is nothing you can do you from the outside (see the initial analysis in the other report). You might want to submit a corresponding PR attempting to fix this though.

@cylindrical2002
Copy link
Author

The handling for invalid lookups has to be reworked in PyPDF itself, so there is nothing you can do you from the outside (see the initial analysis in the other report). You might want to submit a corresponding PR attempting to fix this though.

I got that, thank you!
Did you find this bug in early version of PyPDF?

@stefan6419846
Copy link
Collaborator

Did you read the linked issue? You will see that I discovered this in version 3.15.2. Looking at the code, it might have been introduced in 3.10.0 with 68e2cf0#diff-185702ddcfbf2e4a9ef7106622bb77505eacae032966bba39c65ffb9cd0f9bc7

@cylindrical2002
Copy link
Author

I got your message, thank you!

@pubpub-zz
Copy link
Collaborator

@stefan6419846 the issue is the same. good to have a public code for test.

@Didi3333
Copy link

Didi3333 commented Jan 16, 2024

Hi, i still have the bug with some PDF with version : 3.17.4

Invalid Lookup Table in {'/Subtype': '/Image', '/Filter': '/FlateDecode', '/BitsPerComponent': 8, '/ColorSpace': IndirectObject(10963, 0, 2463122516880), '/Width': 79, '/Height': 82, '/Type': '/XObject'}

panda.pdf

@stefan6419846
Copy link
Collaborator

Please open a new issue with all the required details (and an exact page number) instead of posting on old tickets.

Doing a quick test, this is another issue in your case:

ValueError: not enough image data

Thus it might be a duplicate of #2343, although it does not seem to be the flate handler which fails in your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants