Skip to content

Support rotated pages with extraction_mode="layout" #3270

Closed
@hackowitz-af

Description

@hackowitz-af
Contributor

Explanation

When extracting text from rotated pages, the current options limit useful extraction in layout mode.

  • If strip_rotated=True, a warning is issued and there is no output.
  • If strip_rotated=False, a warning is issued and the output is garbled.

I propose to add an optional orientation: {"infer", 0, 90, 180, 270} = "infer"} to PageObject.extract_text. infer could either use the page['/Rotate'] or use the actual rotation of the text. The names orientation, layout_mode_orientation, rotation, etc. are all the same to me.

I think it's best to add a keyword argument rather than to implicitly use the page['/Rotate'], so one could extract different groups of rotated text from the same page. For example, a page header/footer has 0 rotation, but the page content are rotated 90 degrees. There is value to be able to extract each.

rotated-page.pdf

Code Example

from pypdf import PdfReader
reader = PdfReader("./rotated-page.pdf")

# all to the same effect, for a 90-degree rotated page...
reader.pages[0].extract_text(extraction_mode="layout")
reader.pages[0].extract_text(extraction_mode="layout", orientation="infer")
reader.pages[0].extract_text(extraction_mode="layout", orientation=90)

# to collect different sections of a page, while preserving the layout of each.
header = reader.pages[0].extract_text(extraction_mode="layout", orientation=0)
body = reader.pages[0].extract_text(extraction_mode="layout", orientation=90)

Activity

stefan6419846

stefan6419846 commented on Apr 30, 2025

@stefan6419846
Collaborator

Thanks for the report. We already have the orientations parameter for the default "plain" mode. Adding another parameter orientation solely for the layout mode with nearly the same name sounds confusing.

I think we should evaluate doing some more or less breaking changes for the text extraction here, maybe together with refactoring the plain mode as well (see #3010). What I have in mind:

  • Provide a new method for extracting the text in layout mode.
  • Deprecate the extraction_mode mode in favor of the new method.
  • Clean up the parameters to have a clean interface without confusing users.
  • Get rid of the *args-specific code. I do not know if this has ever been useful.
hackowitz-af

hackowitz-af commented on Apr 30, 2025

@hackowitz-af
ContributorAuthor

I'm super happy to help with a refactor, both for this and for #3010. What would this change look like with respect to versioning? I wouldn't want to unnecessarily force pypdf 6.

hackowitz-af

hackowitz-af commented on Apr 30, 2025

@hackowitz-af
ContributorAuthor

I have found a good-enough working solution to my immediate need via page.transfer_rotation_to_content(). I will make a PR to add a test, and not change any of the extract_text itself

added a commit that references this issue on Apr 30, 2025

Demonstrate that py-pdf#3270 can be adressed using existing functiona…

stefan6419846

stefan6419846 commented on May 1, 2025

@stefan6419846
Collaborator

I'm super happy to help with a refactor, both for this and for #3010. What would this change look like with respect to versioning? I wouldn't want to unnecessarily force pypdf 6.

We have a deprecation process (see developer docs) and tend to issue a new major release once a year when dropping an old Python version.

I am fine with just tackling the easy way in a PR and moving further refactoring to a dedicated issue - I will take care of this accordingly.

added a commit that references this issue on May 16, 2025

TST: Demonstrate that py-pdf#3270 can be resolved using existing func…

d59164b
shartzog

shartzog commented on May 20, 2025

@shartzog
Contributor

Coming in a bit late, but...

@hackowitz-af, I like where your head's at. A provision to just do the rotation for you when all of the text on a page is rotated would be nice to have, but as you've already gathered, the biggest issue here is semantics. The strip_rotated parameter kinda implies that layout mode has at least some provisions for rotated text handling when in truth, it does not, at least not for a page that's rotated wholesale.

The strip_rotated parameter itself was a late addition to my original implementation, and its primary intent was to provide coverage for pages that contained text in multiple orientations, e.g. when everything is copacetic at 0 rotation except an annoying watermark (a la arXiv.org pdfs) or a clever typesetter's rotated titles on a PDF brochure, etc. Under those scenarios, a 'fixed width' algo becomes more or less impossible, so you're left to choose between ignoring the rotated text or junking up the output of the 'properly rotated' text.

I'd be happy to provide input if you wanted to try and identify a 'dominant rotation' and perform the extraction w.r.t. to that orientation by default. Should come in handy for mixed PDFs that throw a rotated landscape page in the middle of 100 pages of vanilla portrait...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-featureA feature requestworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @shartzog@stefan6419846@hackowitz-af

      Issue actions

        Support rotated pages with extraction_mode="layout" · Issue #3270 · py-pdf/pypdf