Skip to content

DOC: Example code doesn't give the right output (fix + proven, didn't want to create a pull request for that) #2431

@etern4l-white

Description

@etern4l-white

I was trying to use the exact same example mentioned in here, but it gives blank output, even though I copied the same code, and same PDF file. (Fix is at the bottom of this issue report)

Environment

Debian

$ python -m platform
Linux-6.1.0-12-amd64-x86_64-with-glibc2.36

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('cryptography', '41.0.7'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue (same example from documentation):

from pypdf import PdfReader

reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]

parts = []


def visitor_body(text, cm, tm, font_dict, font_size):
    y = cm[5]
    if y > 50 and y < 720:
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)

Fix

Just change cm to tm. The selection of height must be from the text matrix, not current matrix.

Here's to the PDF file.

Activity

stefan6419846

stefan6419846 commented on Feb 1, 2024

@stefan6419846
Collaborator

Do you want to submit a corresponding PR?

etern4l-white

etern4l-white commented on Feb 1, 2024

@etern4l-white
Author

Do you want to submit a corresponding PR?

Is it worth a PR? I mean it's only in the documentation, not in the code. If I'm allowed to, then I'm more than open.

stefan6419846

stefan6419846 commented on Feb 1, 2024

@stefan6419846
Collaborator

The docs should ideally be correct, thus fixing it makes sense to avoid confusion for future readers.

etern4l-white

etern4l-white commented on Feb 1, 2024

@etern4l-white
Author

The docs should ideally be correct, thus fixing it makes sense to avoid confusion for future readers.

Ok, I'll open a PR. Thanks.

etern4l-white

etern4l-white commented on Feb 1, 2024

@etern4l-white
Author

Hey @stefan6419846, this is the first time I do a pull request. What's the next step? I think contributors will check that pull request and if it meets the requirements it's accepted? I'm completely new 😅

stefan6419846

stefan6419846 commented on Feb 1, 2024

@stefan6419846
Collaborator

No worries. Everyone did their first PR/contribution at some point in time. And apparently you already found our contribution docs which have told you about the desired PR prefixes ;)

The current maintainer (Martin) will approve the CI run for your commit in the near future to check whether there is anything about your change which draws further attention. As soon as this has been completed, your PR will ideally be approved (maybe after some further manual checks) and then merged into our code base and trigger the rebuild of the hosted docs not later than for the next release.

etern4l-white

etern4l-white commented on Feb 1, 2024

@etern4l-white
Author

Alright, that was very informative. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @stefan6419846@etern4l-white

      Issue actions

        DOC: Example code doesn't give the right output (fix + proven, didn't want to create a pull request for that) · Issue #2431 · py-pdf/pypdf