Is it possible to extract the order of elements in the pdf? #1581

ayusonkj · 2022-02-03T08:58:10Z

ayusonkj
Feb 3, 2022

Given that I have an overlapping series of elements like raster images, vector drawings and texts,
Is it possible to retrieve the order on how they are arranged in the pdf?

For example:

A vector drawing is on top of a raster image.
A Raster Image behind a series of text.
A vector Image behind a Text
etc.

Initially, we try to extract all the elements in a pdf page (Raster, Vectors, Texts) and then save them in a file storage. After that, we want to import these extracted elements the same way they are arranged in the pdf to another format.

is there a way for us to know which elements comes first during element extraction?

Answered by JorjMcKie

Feb 3, 2022

Yes, this is possible. Not completely straightforwardly though.
For methods under my (PyMuPDF) complete control, I am returning a "seqno" (sequence number) item. This currently pertains to

Page.get_drawings()
Page.get_texttrace()

Then there is method Page.get_bboxes() which is a list of all painting actions that a page performs to build its appearance. The sequence of the list items equals the sequence of the page's actions. Each item of that list is a tuple (type, rect_like) where "type" is the action type as a string like "fill-text" / "fill-image" / ..., and the rect_like bbox of the action.
The mentioned items "seqno" from above refer to the index in this list.

Other methods are mor…

View full answer

JorjMcKie · 2022-02-03T10:28:56Z

JorjMcKie
Feb 3, 2022
Maintainer

Yes, this is possible. Not completely straightforwardly though.
For methods under my (PyMuPDF) complete control, I am returning a "seqno" (sequence number) item. This currently pertains to

Page.get_drawings()
Page.get_texttrace()

Then there is method Page.get_bboxes() which is a list of all painting actions that a page performs to build its appearance. The sequence of the list items equals the sequence of the page's actions. Each item of that list is a tuple (type, rect_like) where "type" is the action type as a string like "fill-text" / "fill-image" / ..., and the rect_like bbox of the action.
The mentioned items "seqno" from above refer to the index in this list.

Other methods are more based on MuPDF's TextPage oject, which unfortunately does not contain this infomation.
But as long as you do not sort the output, Page.get_text() should reflect the original sequence. This means that for example Page.get_text("dict") should contain the blocks (image blocks and text blocks) in the original sequence.
And - using the associated bboxes - you should be able to determine their sequence number from the get_bboxes() list.

7 replies

JorjMcKie Feb 8, 2022
Maintainer

Is there a bit more straightforward way to get the order information for each extracted raster image?

Unfortunately there isn't. This was one motivation to make the bboxlog at all. The situation is also blurred by the fact that not all raster images are identifyable by an xref: Inline images don't have them, because they only live inside the page's /Contents. They can only be extracted as part of page.get_text() - not by doc.extract_image().

Apart and on top of all that: there may also exist annotations that cover some areas - plus they may be semi-transparent and / or may have a blendmode.
You can of course ignore or even remove annotations alltogether ...

ayusonkj Feb 10, 2022
Author

Thanks @JorjMcKie for the information, this should be plenty enough information to get the outcome that I needed.

qwertynik Jun 8, 2022

The situation is also blurred by the fact that not all raster images are identifyable by an xref: Inline images don't have them, because they only live inside the page's /Contents. They can only be extracted as part of page.get_text() - not by doc.extract_image().

@JorjMcKie
Unable to find ways using which images can be inlined in a PDF. Can you share a sample PDF with inlined images? Could not find it here too

JorjMcKie Jun 8, 2022
Maintainer

PyMuPDF does not support inserting inline images. Extraction works as described - inline images can be recognized by xref=0.

qwertynik Jun 9, 2022

@JorjMcKie
xref=0 - that's was being looked for. Thanks for sharing without asking.
Did not find a way to insert inline images using Foxit PDF Editor too. While the search for accomplishing this continues. Would be helpful if you could share a PDF with inline images when you come across them.

yangxiaomin08 · 2023-08-15T06:57:05Z

yangxiaomin08
Aug 15, 2023

Hi JorjMcKie,

Has the page.get_bboxes been removed? I can't find the method on this link. https://pymupdf.readthedocs.io/en/latest/page.html

1 reply

JorjMcKie Aug 15, 2023
Maintainer

Look for get_bboxlog() 😎.

benmagos · 2023-11-15T01:06:28Z

benmagos
Nov 15, 2023

@JorjMcKie I believe this is the best place to ask a related follow-up question.

Here is the sample pdf I'm using:
samplefile2.pdf

For text, I can not find a 100% reliable match in get_bboxlog nor get_texttrace

Example:
bboxlog, element 8: (288.8296813964844, 138.7537841796875, 525.1096801757812, 178.39700317382812)
texttrace, element 8: (285.26922607421875, 137.66024780273438, 534.5914306640625, 177.67405700683594)
get_text, span[“bbox”]: (285.2696838378906, 130.70399475097656, 534.5896606445312, 179.47779846191406)

If we set the global before: fitz.TOOLS.set_small_glyph_heights(True) (see #6 in Details section)
then the last item becomes:
gettext, span[“bbox”]: (285.269287109375, 137.66024780273438, 534.5897827148438, 177.674072265625)

Which is VERY close to thetexttrace result above. But not an exact match like get_drawings and images match.

Q: Is there a better way that I am missing to get order for text items?

4 replies

JorjMcKie Nov 15, 2023
Maintainer

Q: Is there a better way that I am missing to get order for text items?

No there isn't - you did very well 😎!
Maybe some more background:
When you extract via get_text() variants, then this always happens based on some TextPage object. The MuPDF code producing the textpage does a lot of this to deliver us theat famous blocks/lines/spans/chars hierarchy on the one hand and on the other hand provide abstraction from the document type, so that we always get the same type of textpage - no matter whether we are dealing with a PDF or an ebook.

In get_texttrace() I am not using a textpage, but I crawl through the text writing commands directly myself, doing the necessary coordinate computations and such. So minor deviations in some decimal places are possible. There also is nothing of MuPDF's magic WRT to generating filler spaces and similar.

Method get_bboxlog() is similar (in not using the textpage), but I am reporting MuPDF's own bbox computation results. So we will see for example 2 separate bboxes for a filled rectangle having a border: one for the inner (filled) part and another one for the stroked (border) part. The stroke bbox always is larger than the fill one.

benmagos Nov 15, 2023

Thanks again for your time @JorjMcKie really helpful validation and context.

My goal is to get the highest fidelity seqno/order for each text span, but those small deviations make it a bit finnicky.
I've been dreaming of a function like Rect.contains(rect) but is Rect.pretty_damn_close_to_equivalent(rect)

Q: I'm presuming the deviations could be in both directions? (e.g. sometimes a larger x0 from get_text() than get_texttrace(), sometimes a smaller one)

Lil pseudocode example, where I'm checking if the span["bbox"] is contained by texttrace elements or vice versa.
See here:

    span_rect = fitz.Rect(span["bbox"])

    for element in text_trace:
        element_rect = fitz.Rect(element["bbox"])
        if element_rect.contains(span_rect) or span_rect.contains(element_rect):
            return element["seqno"]

If I could guarantee that one of the rects provided by either of the places where seqno is available is either always larger or always smaller than the span, something like the above would be usable.
I've tried .round() and subtracting them, but sometimes it will be (0, 0, 0, 0) like I want, others (0, 0, 2, 0) or (0, -1, 0, 0) etc.

Any guidance there? 🙏

JorjMcKie Nov 16, 2023
Maintainer

You could compare the areas of the two rectangle with the area of their intersection along the lines

int_area = abs(r1 & r2)  # area of the intersection rectangle
if int_area >= 0.95 * abs(r1) and int_area >= 0.95 * abs(r2):
    print("both rectangles are petty much equal")

Obviously choose a higher value than 95% to request "better" equality ...

benmagos Nov 16, 2023

Thank you! Will give this a shot today and see how it goes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to extract the order of elements in the pdf? #1581

{{title}}

Replies: 3 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is it possible to extract the order of elements in the pdf? #1581

ayusonkj Feb 3, 2022

Replies: 3 comments · 12 replies

JorjMcKie Feb 3, 2022 Maintainer

JorjMcKie Feb 8, 2022 Maintainer

ayusonkj Feb 10, 2022 Author

qwertynik Jun 8, 2022

JorjMcKie Jun 8, 2022 Maintainer

qwertynik Jun 9, 2022

yangxiaomin08 Aug 15, 2023

JorjMcKie Aug 15, 2023 Maintainer

benmagos Nov 15, 2023

JorjMcKie Nov 15, 2023 Maintainer

benmagos Nov 15, 2023

JorjMcKie Nov 16, 2023 Maintainer

benmagos Nov 16, 2023

ayusonkj
Feb 3, 2022

Replies: 3 comments 12 replies

JorjMcKie
Feb 3, 2022
Maintainer

JorjMcKie Feb 8, 2022
Maintainer

ayusonkj Feb 10, 2022
Author

JorjMcKie Jun 8, 2022
Maintainer

yangxiaomin08
Aug 15, 2023

JorjMcKie Aug 15, 2023
Maintainer

benmagos
Nov 15, 2023

JorjMcKie Nov 15, 2023
Maintainer

JorjMcKie Nov 16, 2023
Maintainer