-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Replace decimal by float #1563
Conversation
Codecov ReportBase: 91.84% // Head: 91.90% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1563 +/- ##
==========================================
+ Coverage 91.84% 91.90% +0.05%
==========================================
Files 33 33
Lines 6196 6250 +54
Branches 1229 1243 +14
==========================================
+ Hits 5691 5744 +53
Misses 326 326
- Partials 179 180 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@MartinThoma |
For OPS (operations per second) higher is better. For the rest, lower is better. Running it a second time with Side-by-side: Left is old, right is newPython 3.6Python 3.11.1Interpretation
|
Side-by-side: left old, right newPython 3.6Python 3.11Comparisonmain branch vs this branchread string from stream: mixed
merge: mixed
page operations: mixed
text extraction: your PR improved it
Python 3.6 vs 3.11 (main branch)
|
@pubpub-zz Looking at those numbers, the performance improvement seems not to be that high (e.g. compare it with #1524 ) I would only merge it if there is no risk of other issues or if it makes the code simpler |
I'm actually pretty surprised that the PR made it sometimes worse. I wonder if that is just because of randomness or if there is another effect (e.g. the new code doesn't allow the specializing interpreter to do its magic) |
was not sure it would have worked😉 Co-authored-by: Martin Thoma <info@martin-thoma.de>
@MartinThoma |
@MartinThoma / @MasterOdin |
If it fixes an issue / doesn't cause issues, I would like to merge it :-) I don' know if this causes an issue. @MasterOdin Do you know? |
This PR would fix #1376, right? |
Correct and some others shall be fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my side, but I think this has a rather high risk of unexpectedly breaking something in a silent / non-obvious way. For this reason I would feel more secure if a second reviewer gave an OK
Alternatively: How do other libraries deal with floats in PDFs?
("99900000000000000123", "99900000000000000000"), | ||
("99900000000000000123.456000", "99900000000000000000"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's probably fine that FloatObject
is no longer capable of such exact precision at the extreme ends, if this were to be merged, then this change should be well communicated to users to avoid any surprises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing code of Sumatra and Pdf.js, 64-bits floating points are used for internal storage. The rounding will be done. This should be transparent to users. I propose to add a comment in changelog and depreciation page
# ( | ||
# "928457298572093487502198745102973402987412908743.75249875981374981237498213740000", | ||
# "928457298572093487502198745102973402987412908743.7524987598137498123749821374", | ||
# ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why comment these test cases out vs fully delete them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if it worth to upgrade the test for those numbers too. your opinion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it'll be something could point at how precision will be lost on plugging in these ridiculous numbers into a FloatObject
. 🤷
Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>
Sumatra and PDF.js are using 64bits floating numbers : they have been well coded and accept long strings. |
I want to give this another week: https://twitter.com/_martinthoma/status/1620871478827954177 It would be really nice to have a discussion board for PDF library developers 🤔 |
I vaguely remember that there is a PDF sub-format which uses PDF for engineering. This PR might be problematic for such PDFs. However, I have nothing concrete. Just the vague fear to mess something up for a small group of users. |
@MartinThoma, |
Forget my idea, test coverage / maintenance will be too difficult |
I think the current PR should be fine. I want to give people time until Sunday to mention any issues. If none come, I'll merge + release. I would NOT make a major version bump as I assume that this will not break workflows. However, I will prominently mention it in the changelog |
I haven't heard of any specific issues and multiple libraries also use float to represent non-integer numbers parsed from PDF libraries. Hence I merged it. Thanks for your good work @pubpub-zz and thank you for your patience 🙏 |
The release will be tomorrow :-) |
NOTICE: pypdf changed the way it represents numbers parsed from PDF files. pypdf<3.4.0 represented numbers as Decimal, pypdf>=3.4.0 represents them as floats. Several other PDF libraries to this, as well as many PDF viewers. We hope to fix issues with too high precision like this and get a speed boost. In case your PDF documents rely on more than 18 decimals of precision you should check if it still works as expected. To clarify: This does not affect the text shown in PDF documents. It affects numbers, e.g. when graphics are drawn on the PDF or very exact positions are used. Typically, 5 decimals should be enough. New Features (ENH) - Enable merging forms with overlapping names (#1553) - Add 'over' parameter to merge_transformend_page & co (#1567) Bug Fixes (BUG) - Fix getter of the PageObject.rotation property with an indirect object (#1602) - Restore merge_transformed_page & co (#1567) - Replace decimal by float (#1563) Robustness (ROB) - PdfWriter.remove_images: /Contents might not be in page_ref (#1598) Developer Experience (DEV) - Introduce ruff (#1586, #1609) Maintenance (MAINT) - Remove decimal (#1608) [Full Changelog](3.3.0...3.4.0)
This PR replaces Decimal by float in order to fix bugs.
It might also improve speed in some cases.
It is a preparation for #1567
Fixes #1527
Fixes #1376