Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: layout mode text extraction ZeroDivisionError #2417

Merged
merged 2 commits into from
Jan 21, 2024

Conversation

shartzog
Copy link
Contributor

For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0.

Discovered during processing of a "pre-OCR'd" image PDF having {"/BaseFont": "/GlyphLessFont"}.

Remove duplicate docstring for layout_mode_strip_rotated

For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0. Discovered during processing of a "pre-OCR'd" image PDF having `{"/BaseFont": "/GlyphLessFont"}`.

Remove duplicate docstring for layout_mode_strip_rotated
@shartzog
Copy link
Contributor Author

Sorry for the quick patch, @MartinThoma, but we picked up a new client with "pre-OCR'd" image PDFs that contained a lot of handwritten text and this error popped up. Nothing urgent so feel free to sit on it for a bit. Just wanted to get it out there while it was top of mind.

Copy link

codecov bot commented Jan 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (26b9a97) 94.42% compared to head (b460bd9) 94.43%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2417   +/-   ##
=======================================
  Coverage   94.42%   94.43%           
=======================================
  Files          49       49           
  Lines        8007     8008    +1     
  Branches     1616     1616           
=======================================
+ Hits         7561     7562    +1     
  Misses        276      276           
  Partials      170      170           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

float 0.0 is already `falsy` and only a "true zero" float results in the ZeroDivisionError. I.e. the int() conversion isn't needed and will likely cause more harm than good.
@MartinThoma
Copy link
Member

MartinThoma commented Jan 20, 2024

I guess you tested the change with a private PDF that has this property?

Sorry for the quick patch

No worries, I will never complain about any contribution that improves pypdf 😄

@shartzog
Copy link
Contributor Author

I guess you tested the change with a private PDF that has this property?

Yes, sorry. The offenders currently at my disposal all contain protected health information. I'll see if I can get our client to scan something over that doesn't. If so, I'll add a test case, but I'd put the odds of them getting back to me on that at ~50/50.

@MartinThoma MartinThoma merged commit 9e494c6 into py-pdf:main Jan 21, 2024
15 checks passed
@MartinThoma MartinThoma deleted the layout-mode-zero-div-patch branch January 21, 2024 10:56
@MartinThoma
Copy link
Member

Thank you!

I've merged the change as it provides value and I trust you that you have tested it. It will be released latest next Sunday.

Adding a test (to the sample-files repository) will ensure that we don't re-introduce this issue.

@shartzog
Copy link
Contributor Author

Thank you!

I've merged the change as it provides value and I trust you that you have tested it. It will be released latest next Sunday.

Adding a test (to the sample-files repository) will ensure that we don't re-introduce this issue.

Thanks! Sounds good.

MartinThoma added a commit that referenced this pull request Jan 28, 2024
## What's new

### Bug Fixes (BUG)
-  layout mode text extraction ZeroDivisionError (#2417) by @shartzog

### Testing (TST)
-  Skip tests using fpdf2 if it\'s not installed (#2419) by @MartinThoma

[Full Changelog](4.0.0...4.0.1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants