-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Labels
workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
I was trying to use the exact same example mentioned in here, but it gives blank output, even though I copied the same code, and same PDF file. (Fix is at the bottom of this issue report)
Environment
Debian
$ python -m platform
Linux-6.1.0-12-amd64-x86_64-with-glibc2.36
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('cryptography', '41.0.7'), PIL=10.2.0
Code + PDF
This is a minimal, complete example that shows the issue (same example from documentation):
from pypdf import PdfReader
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]
parts = []
def visitor_body(text, cm, tm, font_dict, font_size):
y = cm[5]
if y > 50 and y < 720:
parts.append(text)
page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)
print(text_body)
Fix
Just change cm
to tm
. The selection of height must be from the text matrix, not current matrix.
Here's to the PDF file.
Flower-Wang1024
Metadata
Metadata
Assignees
Labels
workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
stefan6419846 commentedon Feb 1, 2024
Do you want to submit a corresponding PR?
etern4l-white commentedon Feb 1, 2024
Is it worth a PR? I mean it's only in the documentation, not in the code. If I'm allowed to, then I'm more than open.
stefan6419846 commentedon Feb 1, 2024
The docs should ideally be correct, thus fixing it makes sense to avoid confusion for future readers.
etern4l-white commentedon Feb 1, 2024
Ok, I'll open a PR. Thanks.
etern4l-white commentedon Feb 1, 2024
Hey @stefan6419846, this is the first time I do a pull request. What's the next step? I think contributors will check that pull request and if it meets the requirements it's accepted? I'm completely new 😅
stefan6419846 commentedon Feb 1, 2024
No worries. Everyone did their first PR/contribution at some point in time. And apparently you already found our contribution docs which have told you about the desired PR prefixes ;)
The current maintainer (Martin) will approve the CI run for your commit in the near future to check whether there is anything about your change which draws further attention. As soon as this has been completed, your PR will ideally be approved (maybe after some further manual checks) and then merged into our code base and trigger the rebuild of the hosted docs not later than for the next release.
etern4l-white commentedon Feb 1, 2024
Alright, that was very informative. Thanks!