-
Given that I have an overlapping series of elements like raster images, vector drawings and texts, For example:
Initially, we try to extract all the elements in a pdf page (Raster, Vectors, Texts) and then save them in a file storage. After that, we want to import these extracted elements the same way they are arranged in the pdf to another format. is there a way for us to know which elements comes first during element extraction? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 12 replies
-
Yes, this is possible. Not completely straightforwardly though.
Then there is method Other methods are more based on MuPDF's TextPage oject, which unfortunately does not contain this infomation. |
Beta Was this translation helpful? Give feedback.
-
Hi JorjMcKie, Has the page.get_bboxes been removed? I can't find the method on this link. https://pymupdf.readthedocs.io/en/latest/page.html |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie I believe this is the best place to ask a related follow-up question. Here is the sample pdf I'm using: For text, I can not find a 100% reliable match in Example: If we set the global before: fitz.TOOLS.set_small_glyph_heights(True) (see #6 in Details section) Which is VERY close to the Q: Is there a better way that I am missing to get order for text items? |
Beta Was this translation helpful? Give feedback.
Yes, this is possible. Not completely straightforwardly though.
For methods under my (PyMuPDF) complete control, I am returning a "seqno" (sequence number) item. This currently pertains to
Page.get_drawings()
Page.get_texttrace()
Then there is method
Page.get_bboxes()
which is a list of all painting actions that a page performs to build its appearance. The sequence of the list items equals the sequence of the page's actions. Each item of that list is a tuple(type, rect_like)
where "type" is the action type as a string like "fill-text" / "fill-image" / ..., and the rect_like bbox of the action.The mentioned items "seqno" from above refer to the index in this list.
Other methods are mor…