Skip to content

Images contained in objects of type "/Pattern" are not retrieved #2613

@0xNath

Description

@0xNath

Explanation

Hello,
First of all, thanks for your works, it's a very helpful library.

I am not able to extract images from PDF generated with OnlyOffice :
B2.pdf

After looking into the PDF structure, it seems that the image in this PDF page, is contained inside a Tiling Patterns object, which can't be handled by "_page._get_ids_image" nor "_page._get_image".

I've took a look at PDF standards and it's specified that Tiling Patterns can be made of images so it's not an OnlyOffice issue.

I don't have read completely the standards about Patterns, but once this is done I'd like to make a proposition to at least be able to retrieve images from them, so when we try to get images from a page, it also considers Patterns.

What do you think about it ?

Have a nice day !

Activity

stefan6419846

stefan6419846 commented on May 1, 2024

@stefan6419846
Collaborator

Thanks for the report. To determine the images associated with a page, pypdf does indeed not consider nested xobjects for image extraction.

added
workflow-imagesFrom a users perspective, image handling is the affected feature/workflow
on May 1, 2024
pubpub-zz

pubpub-zz commented on May 1, 2024

@pubpub-zz
Collaborator

pypdf can looks in sub XObjects, however here you are looking for an object which is part of a pattern which is not for me the way to do things.
this is a proposal to extract your image:

import pypdf

r = pypdf.PdfReader("B2.pdf")
img = pypdf.filters._xobj_to_image(r.pages[0]["/Resources"]["/Pattern"]["/P1"]["/Resources"]["/XObject"]["/X1"])[2]
img.show()

I will try to propose also a easier way to extract an image
edit. I've found a better way

added a commit that references this issue on May 1, 2024
854c467
pubpub-zz

pubpub-zz commented on May 1, 2024

@pubpub-zz
Collaborator

with the new PR extraction will be easier:

import pypdf
r = pypdf.PdfReader("B2.pdf")
img = r.pages[0]["/Resources"]["/Pattern"]["/P1"]["/Resources"]["/XObject"]["/X1"].decode_as_image()
img.show()
0xNath

0xNath commented on May 1, 2024

@0xNath
Author

Wouldn't it be better to have the fonction that should extract all images of a page to actually extract all images of the pages ?

The PDF standard said that images can be stored inside Patterns so we should expect to find images in them.

pubpub-zz

pubpub-zz commented on May 1, 2024

@pubpub-zz
Collaborator

I agree that images can be stored in patterns, but the solution used inhere is not common. a pattern is expected in a context to provided a repeated image in a surface.
There is too many places where images could be (patterns, annotations, ...); will be quite complex also out of context having the image may not be very efficient.

0xNath

0xNath commented on May 1, 2024

@0xNath
Author

We could implement a bool parameter recurse, deepSearch or whatever to the _page.images method.

When set to False, the standards methods _page._get_ids_image, _page._get_image would get called, keeping the image retrieval to it's simplest form, in the inline images and images dictionaries of the page.

When set to True, we could call the standard methods and return on top of their results images found in "special" cases like Patterns.

This way we still keep it efficient for the current usage.

pubpub-zz

pubpub-zz commented on May 1, 2024

@pubpub-zz
Collaborator

We could implement a bool parameter recurse, deepSearch or whatever to the _page.images method.

When set to False, the standards methods _page._get_ids_image, _page._get_image would get called, keeping the image retrieval to it's simplest form, in the inline images and images dictionaries of the page.

When set to True, we could call the standard methods and return on top of their results images found in "special" cases like Patterns.

This way we still keep it efficient for the current usage.

We can propose a PR

0xNath

0xNath commented on May 1, 2024

@0xNath
Author

Well well well, _page.images isn't a method but a property so passing a parameter to it isn't an option...

changed the title [-]Image contained in objects of type "/Pattern" are not retrived[/-] [+]Images contained in objects of type "/Pattern" are not retrieved[/+] on May 2, 2024
added a commit that references this issue on Jun 9, 2024
26d1615
added a commit that references this issue on Aug 3, 2024
88d2223
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-imagesFrom a users perspective, image handling is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @pubpub-zz@0xNath@stefan6419846

      Issue actions

        Images contained in objects of type "/Pattern" are not retrieved · Issue #2613 · py-pdf/pypdf