Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Replace decimal by float #1563

Merged
merged 9 commits into from
Feb 4, 2023
Merged

BUG: Replace decimal by float #1563

merged 9 commits into from
Feb 4, 2023

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Jan 18, 2023

This PR replaces Decimal by float in order to fix bugs.

It might also improve speed in some cases.
It is a preparation for #1567

Fixes #1527
Fixes #1376

@codecov
Copy link

codecov bot commented Jan 18, 2023

Codecov Report

Base: 91.84% // Head: 91.90% // Increases project coverage by +0.05% 🎉

Coverage data is based on head (a4ae6e4) compared to base (8e819d1).
Patch coverage: 92.85% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1563      +/-   ##
==========================================
+ Coverage   91.84%   91.90%   +0.05%     
==========================================
  Files          33       33              
  Lines        6196     6250      +54     
  Branches     1229     1243      +14     
==========================================
+ Hits         5691     5744      +53     
  Misses        326      326              
- Partials      179      180       +1     
Impacted Files Coverage Δ
pypdf/_writer.py 84.54% <73.68%> (+0.17%) ⬆️
pypdf/_reader.py 90.41% <100.00%> (+0.32%) ⬆️
pypdf/generic/_base.py 99.65% <100.00%> (+<0.01%) ⬆️
pypdf/generic/_rectangle.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Jan 18, 2023

@MartinThoma
I've produced this PR as a possible solution to improve speed (you have had identified that FloatObject was slow). Can you try your benchmark ?
PS : no rush 😉😊

@MartinThoma
Copy link
Member

MartinThoma commented Jan 19, 2023

Sure!

Running pytest tests/bench.py --benchmark-json output.json:

This Branch

image

Run 2:

image

Current main

image

Run 2:

image

@MartinThoma
Copy link
Member

For OPS (operations per second) higher is better. For the rest, lower is better.
The mean and the median are the interesting parts. In doubt, I'd take the median.

Running it a second time with pytest tests/bench.py --benchmark-json output.json --benchmark-columns mean,median,rounds --benchmark-min-rounds=10

Side-by-side: Left is old, right is new

Python 3.6

image

Python 3.11.1

image

Interpretation

  • The test results are very unstable with only 10 rounds. I'll run the tests with 100 rounds again
  • In the latest run, the new version is faster in every test. But I made a couple of runs before, hence I would not trust the results
  • The by far biggest performance boost is in the hands of pypdf users: Switch to Python 3.11

@MartinThoma
Copy link
Member

MartinThoma commented Jan 19, 2023

Side-by-side: left old, right new

Python 3.6

image

Python 3.11

image

Comparison

main branch vs this branch

read string from stream: mixed

  • Python 3.6: 95% of original time (your PR improved pypdf)
  • Python 3.11: 104% of original time (your PR made it worse)

merge: mixed

  • Python 3.6: 97% of original time (your PR improved pypdf)
  • Python 3.11: 100% of original time (no change)

page operations: mixed

  • Python 3.6: 93% of original time (your PR improved pypdf)
  • Python 3.11: 102% of original time (your PR made it worse)

text extraction: your PR improved it

  • Python 3.6: 93% of original time
  • Python 3.11: 97% of original time

Python 3.6 vs 3.11 (main branch)

  • read string from stream: 62% of original
  • merge: 75% of original
  • page operations: 70% of original
  • text extraction: 60% of original

@MartinThoma MartinThoma changed the title replace decimal by float PI: Replace decimal by float Jan 19, 2023
@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Jan 19, 2023
@MartinThoma
Copy link
Member

@pubpub-zz Looking at those numbers, the performance improvement seems not to be that high (e.g. compare it with #1524 )

I would only merge it if there is no risk of other issues or if it makes the code simpler

@MartinThoma
Copy link
Member

I'm actually pretty surprised that the PR made it sometimes worse. I wonder if that is just because of randomness or if there is another effect (e.g. the new code doesn't allow the specializing interpreter to do its magic)

pypdf/generic/_base.py Outdated Show resolved Hide resolved
was not sure it would have worked😉

Co-authored-by: Martin Thoma <info@martin-thoma.de>
@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
Don't know what to do with this PR. As you said the performance gain is not tremendous. I have no opinion about keeping it or not.
@MasterOdin any option?

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma / @MasterOdin
The PR I'm working on (merging with transformation) requires the fix on the number of digits of FloatObject. This PR could be a fix to this issue. Can you give back your position and maybe merge it(if not we will have to produce an other solution to fix the issue which is preventing some files to be read by Acrobat Reader

@MartinThoma
Copy link
Member

If it fixes an issue / doesn't cause issues, I would like to merge it :-)

I don' know if this causes an issue.

@MasterOdin Do you know?

@MartinThoma
Copy link
Member

This PR would fix #1376, right?

@pubpub-zz
Copy link
Collaborator Author

Correct and some others shall be fixed

Copy link
Member

@MartinThoma MartinThoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side, but I think this has a rather high risk of unexpectedly breaking something in a silent / non-obvious way. For this reason I would feel more secure if a second reviewer gave an OK

Alternatively: How do other libraries deal with floats in PDFs?

pypdf/generic/_base.py Outdated Show resolved Hide resolved
Comment on lines +989 to +990
("99900000000000000123", "99900000000000000000"),
("99900000000000000123.456000", "99900000000000000000"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's probably fine that FloatObject is no longer capable of such exact precision at the extreme ends, if this were to be merged, then this change should be well communicated to users to avoid any surprises.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing code of Sumatra and Pdf.js, 64-bits floating points are used for internal storage. The rounding will be done. This should be transparent to users. I propose to add a comment in changelog and depreciation page

# (
# "928457298572093487502198745102973402987412908743.75249875981374981237498213740000",
# "928457298572093487502198745102973402987412908743.7524987598137498123749821374",
# ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why comment these test cases out vs fully delete them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if it worth to upgrade the test for those numbers too. your opinion ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it'll be something could point at how precision will be lost on plugging in these ridiculous numbers into a FloatObject. 🤷

Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>
@pubpub-zz
Copy link
Collaborator Author

Looks good from my side, but I think this has a rather high risk of unexpectedly breaking something in a silent / non-obvious way. For this reason I would feel more secure if a second reviewer gave an OK

Alternatively: How do other libraries deal with floats in PDFs?

Sumatra and PDF.js are using 64bits floating numbers : they have been well coded and accept long strings.
Acrobat is less tolerant.

@MartinThoma
Copy link
Member

I want to give this another week: https://twitter.com/_martinthoma/status/1620871478827954177

It would be really nice to have a discussion board for PDF library developers 🤔

@MartinThoma
Copy link
Member

I vaguely remember that there is a PDF sub-format which uses PDF for engineering. This PR might be problematic for such PDFs.

However, I have nothing concrete. Just the vague fear to mess something up for a small group of users.

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma,
may be we could use an alias type that can ease to move back to decimal. Your opinion?

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma,
may be we could use an alias type that can ease to move back to decimal. Your opinion?

Forget my idea, test coverage / maintenance will be too difficult

@MartinThoma
Copy link
Member

I think the current PR should be fine. I want to give people time until Sunday to mention any issues. If none come, I'll merge + release.

I would NOT make a major version bump as I assume that this will not break workflows. However, I will prominently mention it in the changelog

@MartinThoma MartinThoma changed the title PI: Replace decimal by float BUG: Replace decimal by float Feb 4, 2023
@MartinThoma MartinThoma merged commit 6ec88ad into py-pdf:main Feb 4, 2023
@MartinThoma
Copy link
Member

I haven't heard of any specific issues and multiple libraries also use float to represent non-integer numbers parsed from PDF libraries. Hence I merged it.

Thanks for your good work @pubpub-zz and thank you for your patience 🙏

@MartinThoma
Copy link
Member

The release will be tomorrow :-)

MartinThoma added a commit that referenced this pull request Feb 5, 2023
NOTICE: pypdf changed the way it represents numbers parsed from PDF files.
  pypdf<3.4.0 represented numbers as Decimal, pypdf>=3.4.0 represents them as
  floats. Several other PDF libraries to this, as well as many PDF viewers.
  We hope to fix issues with too high precision like this and get a speed boost.
  In case your PDF documents rely on more than 18 decimals of precision you
  should check if it still works as expected.
  To clarify: This does not affect the text shown in PDF documents. It affects
  numbers, e.g. when graphics are drawn on the PDF or very exact positions are
  used. Typically, 5 decimals should be enough.

New Features (ENH)
-  Enable merging forms with overlapping names (#1553)
-  Add 'over' parameter to merge_transformend_page & co (#1567)

Bug Fixes (BUG)
-  Fix getter of the PageObject.rotation property with an indirect object (#1602)
-  Restore merge_transformed_page & co (#1567)
-  Replace decimal by float (#1563)

Robustness (ROB)
-  PdfWriter.remove_images: /Contents might not be in page_ref (#1598)

Developer Experience (DEV)
-  Introduce ruff (#1586, #1609)

Maintenance (MAINT)
-  Remove decimal (#1608)

[Full Changelog](3.3.0...3.4.0)
@pubpub-zz pubpub-zz deleted the try_float branch June 24, 2023 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-performance Non-functional change: Performance
Projects
None yet
3 participants