-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336
Comments
This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces TL;DR: How a text layer is being retrieved depends on the actual library implementation - each tends to have its own advantages and limits. In this specific case, the |
I understand. Is there any way I can work around it in pypdf? Other PDF libraries (like pymupdf, based on mupdf) don't have that problem. |
You might want to have a look at the code from #2038 (comment). |
## What's new ### Bug Fixes (BUG) - Handle IndirectObject as image filter (#2355) by @stefan6419846 ### Documentation (DOC) - Quote specs in generate_file_identifiers (#2363) by @exiledkingcc - Notes about form fields and annotations (#1945) by @dmjohnsson23 - Notes about update_page_form_field_values(auto_regenerate) (#2359) by @dmjohnsson23 - Fix stamping example (#2358) by @dmjohnsson23 - Stamp images directly on a PDF (#2357) by @dmjohnsson23 - Correct the example of adding highlight annotation (#2341) by @Tobeabellwether ### Maintenance (MAINT) - Update upload-artifact and download-artifact actions from v3 to v4 (#2352) by @stefan6419846 ### Testing (TST) - Add xfail test for #2336 (#2365) by @MartinThoma - Increase test coverage for flate handling of image mode 1 (#2339) by @stefan6419846 ### Code Style (STY) - File identifier generation restructuring (#2362) by @exiledkingcc - Add PdfWriter._ID attribute (#2361) by @exiledkingcc - Variable naming convention (#2360) by @MartinThoma [Full Changelog](3.17.3...3.17.4)
@renanbirck |
According to #2882 (comment), this has just been fixed. |
I am trying to parse this PDF. However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF.
See the screenshot - the original PDF on the left, the output of for what I mean (e.g. "Av. Beir a Rio" should be "Av. Beira Rio", "Cen tro" should be "Centro"):
If I copy/paste from Okular or other PDF reader to a text document, it is copied correctly, so I know the PDF file is not broken.
Environment
I am using Python 3.12 in Fedora 39.
Code + PDF
This is a minimal, complete example that shows the issue:
The text was updated successfully, but these errors were encountered: