BUG: Replace decimal by float #1563

pubpub-zz · 2023-01-18T20:44:26Z

This PR replaces Decimal by float in order to fix bugs.

It might also improve speed in some cases.
It is a preparation for #1567

for performance

codecov · 2023-01-18T21:56:39Z

Codecov Report

Base: 91.84% // Head: 91.90% // Increases project coverage by +0.05% 🎉

Coverage data is based on head (a4ae6e4) compared to base (8e819d1).
Patch coverage: 92.85% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1563      +/-   ##
==========================================
+ Coverage   91.84%   91.90%   +0.05%     
==========================================
  Files          33       33              
  Lines        6196     6250      +54     
  Branches     1229     1243      +14     
==========================================
+ Hits         5691     5744      +53     
  Misses        326      326              
- Partials      179      180       +1

Impacted Files	Coverage Δ
pypdf/_writer.py	`84.54% <73.68%> (+0.17%)`	⬆️
pypdf/_reader.py	`90.41% <100.00%> (+0.32%)`	⬆️
pypdf/generic/_base.py	`99.65% <100.00%> (+<0.01%)`	⬆️
pypdf/generic/_rectangle.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

pubpub-zz · 2023-01-18T22:05:52Z

@MartinThoma
I've produced this PR as a possible solution to improve speed (you have had identified that FloatObject was slow). Can you try your benchmark ?
PS : no rush 😉😊

MartinThoma · 2023-01-19T20:04:49Z

Sure!

Running pytest tests/bench.py --benchmark-json output.json:

This Branch

Run 2:

Current `main`

Run 2:

MartinThoma · 2023-01-19T20:31:50Z

For OPS (operations per second) higher is better. For the rest, lower is better.
The mean and the median are the interesting parts. In doubt, I'd take the median.

Running it a second time with pytest tests/bench.py --benchmark-json output.json --benchmark-columns mean,median,rounds --benchmark-min-rounds=10

Side-by-side: Left is old, right is new

Python 3.6

Python 3.11.1

Interpretation

The test results are very unstable with only 10 rounds. I'll run the tests with 100 rounds again
In the latest run, the new version is faster in every test. But I made a couple of runs before, hence I would not trust the results
The by far biggest performance boost is in the hands of pypdf users: Switch to Python 3.11 ❗

MartinThoma · 2023-01-19T20:54:46Z

Side-by-side: left old, right new

Python 3.6

Python 3.11

Comparison

main branch vs this branch

read string from stream: mixed

Python 3.6: 95% of original time (your PR improved pypdf)
Python 3.11: 104% of original time (your PR made it worse)

merge: mixed

Python 3.6: 97% of original time (your PR improved pypdf)
Python 3.11: 100% of original time (no change)

page operations: mixed

Python 3.6: 93% of original time (your PR improved pypdf)
Python 3.11: 102% of original time (your PR made it worse)

text extraction: your PR improved it

Python 3.6: 93% of original time
Python 3.11: 97% of original time

Python 3.6 vs 3.11 (main branch)

read string from stream: 62% of original
merge: 75% of original
page operations: 70% of original
text extraction: 60% of original

MartinThoma · 2023-01-19T21:51:31Z

@pubpub-zz Looking at those numbers, the performance improvement seems not to be that high (e.g. compare it with #1524 )

I would only merge it if there is no risk of other issues or if it makes the code simpler

MartinThoma · 2023-01-19T21:53:16Z

I'm actually pretty surprised that the PR made it sometimes worse. I wonder if that is just because of randomness or if there is another effect (e.g. the new code doesn't allow the specializing interpreter to do its magic)

pypdf/generic/_base.py

was not sure it would have worked😉 Co-authored-by: Martin Thoma <info@martin-thoma.de>

pubpub-zz · 2023-01-20T17:04:58Z

@MartinThoma
Don't know what to do with this PR. As you said the performance gain is not tremendous. I have no opinion about keeping it or not.
@MasterOdin any option?

pubpub-zz · 2023-01-28T16:37:38Z

@MartinThoma / @MasterOdin
The PR I'm working on (merging with transformation) requires the fix on the number of digits of FloatObject. This PR could be a fix to this issue. Can you give back your position and maybe merge it(if not we will have to produce an other solution to fix the issue which is preventing some files to be read by Acrobat Reader

MartinThoma · 2023-01-29T12:04:44Z

If it fixes an issue / doesn't cause issues, I would like to merge it :-)

I don' know if this causes an issue.

@MasterOdin Do you know?

MartinThoma · 2023-01-29T12:08:30Z

This PR would fix #1376, right?

pubpub-zz · 2023-01-29T21:11:27Z

Correct and some others shall be fixed

MartinThoma

Looks good from my side, but I think this has a rather high risk of unexpectedly breaking something in a silent / non-obvious way. For this reason I would feel more secure if a second reviewer gave an OK

Alternatively: How do other libraries deal with floats in PDFs?

pypdf/generic/_base.py

MasterOdin · 2023-01-30T15:03:42Z

tests/test_generic.py

+        ("99900000000000000123", "99900000000000000000"),
+        ("99900000000000000123.456000", "99900000000000000000"),


While it's probably fine that FloatObject is no longer capable of such exact precision at the extreme ends, if this were to be merged, then this change should be well communicated to users to avoid any surprises.

Reviewing code of Sumatra and Pdf.js, 64-bits floating points are used for internal storage. The rounding will be done. This should be transparent to users. I propose to add a comment in changelog and depreciation page

MasterOdin · 2023-01-30T15:04:12Z

tests/test_generic.py

+        # (
+        #    "928457298572093487502198745102973402987412908743.75249875981374981237498213740000",
+        #    "928457298572093487502198745102973402987412908743.7524987598137498123749821374",
+        # ),


Why comment these test cases out vs fully delete them?

Just wondering if it worth to upgrade the test for those numbers too. your opinion ?

I suppose it'll be something could point at how precision will be lost on plugging in these ridiculous numbers into a FloatObject. 🤷

Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>

pubpub-zz · 2023-01-30T19:13:17Z

Looks good from my side, but I think this has a rather high risk of unexpectedly breaking something in a silent / non-obvious way. For this reason I would feel more secure if a second reviewer gave an OK

Alternatively: How do other libraries deal with floats in PDFs?

Sumatra and PDF.js are using 64bits floating numbers : they have been well coded and accept long strings.
Acrobat is less tolerant.

MartinThoma · 2023-02-01T19:48:25Z

I want to give this another week: https://twitter.com/_martinthoma/status/1620871478827954177

It would be really nice to have a discussion board for PDF library developers 🤔

MartinThoma · 2023-02-01T19:49:56Z

I vaguely remember that there is a PDF sub-format which uses PDF for engineering. This PR might be problematic for such PDFs.

However, I have nothing concrete. Just the vague fear to mess something up for a small group of users.

pubpub-zz · 2023-02-02T18:11:44Z

@MartinThoma,
may be we could use an alias type that can ease to move back to decimal. Your opinion?

pubpub-zz · 2023-02-02T18:46:22Z

@MartinThoma,
may be we could use an alias type that can ease to move back to decimal. Your opinion?

Forget my idea, test coverage / maintenance will be too difficult

MartinThoma · 2023-02-02T19:54:48Z

I think the current PR should be fine. I want to give people time until Sunday to mention any issues. If none come, I'll merge + release.

I would NOT make a major version bump as I assume that this will not break workflows. However, I will prominently mention it in the changelog

MartinThoma · 2023-02-04T19:02:15Z

I haven't heard of any specific issues and multiple libraries also use float to represent non-integer numbers parsed from PDF libraries. Hence I merged it.

Thanks for your good work @pubpub-zz and thank you for your patience 🙏

MartinThoma · 2023-02-04T19:02:27Z

The release will be tomorrow :-)

NOTICE: pypdf changed the way it represents numbers parsed from PDF files. pypdf<3.4.0 represented numbers as Decimal, pypdf>=3.4.0 represents them as floats. Several other PDF libraries to this, as well as many PDF viewers. We hope to fix issues with too high precision like this and get a speed boost. In case your PDF documents rely on more than 18 decimals of precision you should check if it still works as expected. To clarify: This does not affect the text shown in PDF documents. It affects numbers, e.g. when graphics are drawn on the PDF or very exact positions are used. Typically, 5 decimals should be enough. New Features (ENH) - Enable merging forms with overlapping names (#1553) - Add 'over' parameter to merge_transformend_page & co (#1567) Bug Fixes (BUG) - Fix getter of the PageObject.rotation property with an indirect object (#1602) - Restore merge_transformed_page & co (#1567) - Replace decimal by float (#1563) Robustness (ROB) - PdfWriter.remove_images: /Contents might not be in page_ref (#1598) Developer Experience (DEV) - Introduce ruff (#1586, #1609) Maintenance (MAINT) - Remove decimal (#1608) [Full Changelog](3.3.0...3.4.0)

pubpub-zz added 4 commits January 18, 2023 21:39

replace decimal by float

3d8f610

for performance

flake8

1748193

pytest

08fad9c

mypy

b05e1d4

MartinThoma changed the title ~~replace decimal by float~~ PI: Replace decimal by float Jan 19, 2023

MartinThoma added the nf-performance Non-functional change: Performance label Jan 19, 2023

MartinThoma reviewed Jan 19, 2023

View reviewed changes

pypdf/generic/_base.py Show resolved Hide resolved

MartinThoma reviewed Jan 19, 2023

View reviewed changes

pypdf/generic/_base.py Outdated Show resolved Hide resolved

Update pypdf/generic/_base.py

e3a6ebc

was not sure it would have worked😉 Co-authored-by: Martin Thoma <info@martin-thoma.de>

Merge remote-tracking branch 'py-pdf/main' into try_float

d98d0e4

removed unused code

205131e

MartinThoma requested a review from MasterOdin January 30, 2023 09:44

MartinThoma approved these changes Jan 30, 2023

View reviewed changes

MasterOdin reviewed Jan 30, 2023

View reviewed changes

Update from @MasterOdin comment

0488c7c

Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>

MartinThoma mentioned this pull request Jan 30, 2023

ENH: Restore merge_transformed_page & co + add 'over' parameter #1567

Merged

Merge branch 'main' into try_float

a4ae6e4

MartinThoma changed the title ~~PI: Replace decimal by float~~ BUG: Replace decimal by float Feb 4, 2023

MartinThoma merged commit 6ec88ad into py-pdf:main Feb 4, 2023

mrknwk mentioned this pull request Feb 5, 2023

Output PDF has loss of data #1607

Closed

pubpub-zz deleted the try_float branch June 24, 2023 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Replace decimal by float #1563

BUG: Replace decimal by float #1563

pubpub-zz commented Jan 18, 2023 •

edited by MartinThoma

Loading

codecov bot commented Jan 18, 2023 •

edited

Loading

pubpub-zz commented Jan 18, 2023 •

edited

Loading

MartinThoma commented Jan 19, 2023 •

edited

Loading

MartinThoma commented Jan 19, 2023

MartinThoma commented Jan 19, 2023 •

edited

Loading

MartinThoma commented Jan 19, 2023

MartinThoma commented Jan 19, 2023

pubpub-zz commented Jan 20, 2023

pubpub-zz commented Jan 28, 2023

MartinThoma commented Jan 29, 2023

MartinThoma commented Jan 29, 2023

pubpub-zz commented Jan 29, 2023

MartinThoma left a comment

MasterOdin Jan 30, 2023

pubpub-zz Jan 30, 2023

MasterOdin Jan 30, 2023

pubpub-zz Jan 30, 2023

MasterOdin Feb 2, 2023

pubpub-zz commented Jan 30, 2023

MartinThoma commented Feb 1, 2023

MartinThoma commented Feb 1, 2023

pubpub-zz commented Feb 2, 2023

pubpub-zz commented Feb 2, 2023

MartinThoma commented Feb 2, 2023

MartinThoma commented Feb 4, 2023

MartinThoma commented Feb 4, 2023

		("99900000000000000123", "99900000000000000000"),
		("99900000000000000123.456000", "99900000000000000000"),

BUG: Replace decimal by float #1563

BUG: Replace decimal by float #1563

Conversation

pubpub-zz commented Jan 18, 2023 • edited by MartinThoma Loading

codecov bot commented Jan 18, 2023 • edited Loading

Codecov Report

pubpub-zz commented Jan 18, 2023 • edited Loading

MartinThoma commented Jan 19, 2023 • edited Loading

This Branch

Current main

MartinThoma commented Jan 19, 2023

Side-by-side: Left is old, right is new

Python 3.6

Python 3.11.1

Interpretation

MartinThoma commented Jan 19, 2023 • edited Loading

Side-by-side: left old, right new

Python 3.6

Python 3.11

Comparison

main branch vs this branch

Python 3.6 vs 3.11 (main branch)

MartinThoma commented Jan 19, 2023

MartinThoma commented Jan 19, 2023

pubpub-zz commented Jan 20, 2023

pubpub-zz commented Jan 28, 2023

MartinThoma commented Jan 29, 2023

MartinThoma commented Jan 29, 2023

pubpub-zz commented Jan 29, 2023

MartinThoma left a comment

Choose a reason for hiding this comment

MasterOdin Jan 30, 2023

Choose a reason for hiding this comment

pubpub-zz Jan 30, 2023

Choose a reason for hiding this comment

MasterOdin Jan 30, 2023

Choose a reason for hiding this comment

pubpub-zz Jan 30, 2023

Choose a reason for hiding this comment

MasterOdin Feb 2, 2023

Choose a reason for hiding this comment

pubpub-zz commented Jan 30, 2023

MartinThoma commented Feb 1, 2023

MartinThoma commented Feb 1, 2023

pubpub-zz commented Feb 2, 2023

pubpub-zz commented Feb 2, 2023

MartinThoma commented Feb 2, 2023

MartinThoma commented Feb 4, 2023

MartinThoma commented Feb 4, 2023

pubpub-zz commented Jan 18, 2023 •

edited by MartinThoma

Loading

codecov bot commented Jan 18, 2023 •

edited

Loading

pubpub-zz commented Jan 18, 2023 •

edited

Loading

MartinThoma commented Jan 19, 2023 •

edited

Loading

Current `main`

MartinThoma commented Jan 19, 2023 •

edited

Loading