Closed
Description
Explanation
When extracting text from rotated pages, the current options limit useful extraction in layout mode.
- If
strip_rotated=True
, a warning is issued and there is no output. - If
strip_rotated=False
, a warning is issued and the output is garbled.
I propose to add an optional orientation: {"infer", 0, 90, 180, 270} = "infer"}
to PageObject.extract_text
. infer
could either use the page['/Rotate']
or use the actual rotation of the text. The names orientation
, layout_mode_orientation
, rotation
, etc. are all the same to me.
I think it's best to add a keyword argument rather than to implicitly use the page['/Rotate']
, so one could extract different groups of rotated text from the same page. For example, a page header/footer has 0 rotation, but the page content are rotated 90 degrees. There is value to be able to extract each.
Code Example
from pypdf import PdfReader
reader = PdfReader("./rotated-page.pdf")
# all to the same effect, for a 90-degree rotated page...
reader.pages[0].extract_text(extraction_mode="layout")
reader.pages[0].extract_text(extraction_mode="layout", orientation="infer")
reader.pages[0].extract_text(extraction_mode="layout", orientation=90)
# to collect different sections of a page, while preserving the layout of each.
header = reader.pages[0].extract_text(extraction_mode="layout", orientation=0)
body = reader.pages[0].extract_text(extraction_mode="layout", orientation=90)
Activity
stefan6419846 commentedon Apr 30, 2025
Thanks for the report. We already have the
orientations
parameter for the default "plain" mode. Adding another parameterorientation
solely for the layout mode with nearly the same name sounds confusing.I think we should evaluate doing some more or less breaking changes for the text extraction here, maybe together with refactoring the plain mode as well (see #3010). What I have in mind:
extraction_mode
mode in favor of the new method.*args
-specific code. I do not know if this has ever been useful.hackowitz-af commentedon Apr 30, 2025
I'm super happy to help with a refactor, both for this and for #3010. What would this change look like with respect to versioning? I wouldn't want to unnecessarily force pypdf 6.
hackowitz-af commentedon Apr 30, 2025
I have found a good-enough working solution to my immediate need via
page.transfer_rotation_to_content()
. I will make a PR to add a test, and not change any of theextract_text
itselfDemonstrate that py-pdf#3270 can be adressed using existing functiona…
stefan6419846 commentedon May 1, 2025
We have a deprecation process (see developer docs) and tend to issue a new major release once a year when dropping an old Python version.
I am fine with just tackling the easy way in a PR and moving further refactoring to a dedicated issue - I will take care of this accordingly.
TST: Demonstrate that py-pdf#3270 can be resolved using existing func…
shartzog commentedon May 20, 2025
Coming in a bit late, but...
@hackowitz-af, I like where your head's at. A provision to just do the rotation for you when all of the text on a page is rotated would be nice to have, but as you've already gathered, the biggest issue here is semantics. The
strip_rotated
parameter kinda implies that layout mode has at least some provisions for rotated text handling when in truth, it does not, at least not for a page that's rotated wholesale.The
strip_rotated
parameter itself was a late addition to my original implementation, and its primary intent was to provide coverage for pages that contained text in multiple orientations, e.g. when everything is copacetic at0
rotation except an annoying watermark (a la arXiv.org pdfs) or a clever typesetter's rotated titles on a PDF brochure, etc. Under those scenarios, a 'fixed width' algo becomes more or less impossible, so you're left to choose between ignoring the rotated text or junking up the output of the 'properly rotated' text.I'd be happy to provide input if you wanted to try and identify a 'dominant rotation' and perform the extraction w.r.t. to that orientation by default. Should come in handy for mixed PDFs that throw a rotated landscape page in the middle of 100 pages of vanilla portrait...