Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PI: Making pypdf as fast as pdfrw #2086

Merged
merged 17 commits into from
Aug 26, 2023
Merged

PI: Making pypdf as fast as pdfrw #2086

merged 17 commits into from
Aug 26, 2023

Conversation

Lucas-C
Copy link
Member

@Lucas-C Lucas-C commented Aug 13, 2023

This is a follow-up of sarnold/pdfrw#15 (comment)

Using the code from this branch, pypdf_watermarking is as fast as pdfrw_watermarking().

pypdf was made faster by avoiding unnecessarily calls to __parse_content_stream().
In order to achieve that, some calls to PageObject._push_pop_gs() were removed in PageObject._merge_page() and PageObject._merge_page_writer()

@Lucas-C
Copy link
Member Author

Lucas-C commented Aug 13, 2023

I'm surprised by the low number of failing tests:
============= 7 failed, 773 passed, 4 xfailed in 449.28s (0:07:29) ============= (Python 3.11)

This is encouraging! 😄

@Lucas-C
Copy link
Member Author

Lucas-C commented Aug 14, 2023

There are only 4 remaining tests that fail. They can be executed with:
pytest -vv tests/test_workflows.py tests/test_writer.py -k 'test_extra_test_iss1541 or test_remove_images[side-by-side-subfig.pdf] or test_iss1601 or test_new_removes'

@MartinThoma
Copy link
Member

MartinThoma commented Aug 14, 2023

I've just merged the comment improvements into main and updated the branch so that those don't distract from the important changes.

I also just ran https://github.com/py-pdf/benchmarks to see the impact:

  • Text extraction speed: 2.7s → 2.5s (could be random)
  • Image extraction speed: 3.3s → 3.0s (could be random)
  • Watermarking speed: 11.7s → 0.5s ❗ 🤯 😲
  • Watermarking file size: Decreased as well 🤔

@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Aug 14, 2023
@pubpub-zz
Copy link
Collaborator

@Lucas-C
I've proposed PR #1854 that allows to modify EncodedStreamData : Can't you just use this capability to directly add the waterwaterking on the "raw" encoded stream without converting it into ContentStream ?

@Lucas-C
Copy link
Member Author

Lucas-C commented Aug 14, 2023

I've proposed PR #1854 that allows to modify EncodedStreamData : Can't you just use this capability to directly add the waterwaterking on the "raw" encoded stream without converting it into ContentStream ?

Interesting.
I don't quite follow what "conversion into ContentStream" you are refering to...
You mean in PageObject._merge_page()?

I also just ran py-pdf/benchmarks to see the impact:
Watermarking speed: 11.7s → 0.5s ❗ 🤯 😲

Yes, that's promising!

By the way, I was wondering: would you be open to add non-regression unit tests focused on performances @MartinThoma?
We have a couple of them in fpdf2:

That have some drawbacks (those tests will fail when executed on slow computers),
but they are a good way to ensure that the code performances stay good over time,
for some a few given scenarios.

@MartinThoma
Copy link
Member

would you be open to add non-regression unit tests focused on performances

Yes!

We have only two so far:

The Benchmark was intended to show changes, but it seems as if the executing machine in Github CI just behaves wildly differently. Maybe I should execute those for the releases on my machine 🤔 Or maybe there is something one can do to "normalize" the results?

@MartinThoma
Copy link
Member

@pubpub-zz I would appreciate some help here :-)

test_extra_test_iss1541

This test fails:

@pytest.mark.enable_socket()
def test_extra_test_iss1541():
    url = "https://github.com/py-pdf/pypdf/files/10418158/tst_iss1541.pdf"
    name = "tst_iss1541.pdf"
    data = BytesIO(get_data_from_url(url, name=name))
    reader = PdfReader(data, strict=False)
    reader.pages[0].extract_text()

    cs = ContentStream(reader.pages[0]["/Contents"], None, None)
    cs.operations.insert(-1, ([], b"EMC"))
    stream = BytesIO()
    cs.write_to_stream(stream)
    stream.seek(0)
    ContentStream(read_object(stream, None, None), None, None).operations

    cs = ContentStream(reader.pages[0]["/Contents"], None, None)
    cs.operations.insert(-1, ([], b"E!C"))
    stream = BytesIO()
    cs.write_to_stream(stream)
    stream.seek(0)
    with pytest.raises(PdfReadError) as exc:
        ContentStream(read_object(stream, None, None), None, None).operations
    assert exc.value.args[0] == "Unexpected end of stream"

No exception is thrown. I don't know if we would want an exception to be thrown there or if it's actually good that we no longer do it.

test_writer.py::test_remove_images

    @pytest.mark.parametrize(
        "input_path",
        [
            "side-by-side-subfig.pdf",
            "reportlab-inline-image.pdf",
        ],
    )
    def test_remove_images(pdf_file_path, input_path):
        pdf_path = RESOURCE_ROOT / input_path
    
        reader = PdfReader(pdf_path)
        writer = PdfWriter()
    
        page = reader.pages[0]
        writer.insert_page(page, 0)
        writer.remove_images()
    
        # finally, write "output" to pypdf-output.pdf
        with open(pdf_file_path, "wb") as output_stream:
            writer.write(output_stream)
    
        with open(pdf_file_path, "rb") as input_stream:
            reader = PdfReader(input_stream)
            if input_path == "side-by-side-subfig.pdf":
                extracted_text = reader.pages[0].extract_text()
>               assert "Lorem ipsum dolor sit amet" in extracted_text
E               AssertionError: assert 'Lorem ipsum dolor sit amet' in ''

That is for sure not good.

test_writer.py::test_iss1601

>       assert (
            ContentStream(reader.pages[0].get_contents(), reader).get_data()
            in page_1.get_contents().get_data()
        )

AssertionError: assert 
b'q\nq 1 0 0 1 0 0 cm /Xi0 Do Q\nQ\n\n1 0 0 1 0 0 cm  BT /F1 12 Tf 14.4 TL ET\nq\n1 0 0 1 34.24252 90.54567 cm\nq\n1 0...8 cm\nq\n0 0 0 rg\nBT 1 0 0 1 0 2.7 Tm 90.72463 0 Td /F2+0 18 Tf 20.7 TL (R M) Tj T* -90.72463 0 Td ET\nQ\nQ\nQ\n \n\n' in
b'q\n1 0.0 0.0 1 0.0 0.0 cm\n0.0 0.0 255.11811 153.07087000000001 re\nW\nn\nq\nq\n1 0 0 1 0 0 cm\n/Xi0 Do\nQ\nQ\n1 0 0...4630000000005 0 Td\n/F2+0 18 Tf\n20.699999999999999 TL\n(R\\040M) Tj\nT*\n-90.724630000000005 0 Td\nET\nQ\nQ\nQ\nQ\n\n'

The left-hand side starts with q\nq 1 0, but the right-hand side is q\n1 0.0. S left has an additional Q and the zero floating point formatting is different. This might be related to the file size.

test_writer.py::test_new_removes

>       assert b"/Fm0 Do" in bb
E       assert b'/Fm0 Do' in b'%PDF-1.3\n%\xe2\xe3\xcf\xd3\n1 0 obj\n<<\n/Type /Catalog\n/Pages 2 0 R\n/Outlines 46 0 R\n>>\nendobj\n2 0 obj\n<<\n/Type /Pages\n/Count 3\n/Kids [ 3 0 R 42 0 R 44 0 R ]\n>>\nendobj\n3 0 obj\n<<\n/Type /Page\n/MediaBox [ 0 0 595.3039 841.88980000000004 ]\n/Resources 4 0 R\n/Rotate 0\n/Contents 48 0 R\n/Group <<\n/CS /DeviceRGB\n/I true\n/S /Transparency\n>>\n/Parent 2 0 R\n/Annots [ 29 0 R ]\n>>\nendobj\n4 0 obj\n<<\n/Font 5 0 R\n/XObject <<\n/Fm0 18 0 R\n/Fm1 24 0 R\n/Im0 25 0 R\n/Im4 26 0 R\n>>\n/ProcSet [ /ImageB /ImageC /ImageI /PDF /Text ]\n/ExtGState <<\n/GS0 27 0 R\n/GS1 21 0 R\n>>\n/ColorSpace <<\n/CS0 19 0 R\n>>\n>>\nendobj\n5 0 obj\n<<\n/F1 6 0 R\n/F2 10 0 R\n/TT0 14 0 R\n>>\nendobj\n6 0 obj\n<<\n/Type /Font\n/Subtype /TrueType\n/BaseFont /BAAAAA+LiberationSans-Bold\n/FirstChar 0\n/LastChar 51\n/Widths [ 750 666 666 722 666 777 777 722 722 666 610 722 610 277 277 722 610 389 610 556 610 556 333 277 333 556 556 556 556 556 277 889 556 277 722 277 777 556 610 500 333 610 556 610 610 610 722 610 610 833 333 666 ]\n/FontDescriptor 7 0 R\n/ToUnicode 9 0 R\n>>\nendobj\n7 0 obj\n<<\n/Type /FontDescriptor\n/FontName /BAAAAA+LiberationSans-Bold\n/Flags 4\n/FontBBox [ -481 -376 1303 1033 ]\n/ItalicAngle 0\n/Ascent 905\n/Descent -211\n/CapHeight 1033\n/StemV 80\n/FontFile2 8 0 R\n>>\nendobj\n8 0 obj\n<<\n/Filter /FlateDecode\n/Length1 17136\n/Length 

@MartinThoma MartinThoma added the help wanted We appreciate help everywhere - this one might be an easy start! label Aug 15, 2023
@MartinThoma
Copy link
Member

I've proposed PR #1854 that allows to modify EncodedStreamData : Can't you just use this capability to directly add the waterwaterking on the "raw" encoded stream without converting it into ContentStream ?

Can you make an example PR?

I think this PR is interesting for three reasons (1) Huge performance improvement in watermarking (2) File size reductions (3) test_extra_test_iss1541 no longer throws an exception ... I'm not sure if that is good though 😄

@Lucas-C
Copy link
Member Author

Lucas-C commented Aug 16, 2023

Among the 4 previously failing tests, only tests/test_writer.py::test_iss1601 is still failing.

But new unit tests are failing:

FAILED tests/test_generic.py::test_encodedstream_set_data - pypdf.errors.PdfR...
FAILED tests/test_workflows.py::test_merge_output - NotImplementedError: unsu...
FAILED tests/test_workflows.py::test_image_extraction[https://corpora.tika.apache.org/base/docs/govdocs1/972/972174.pdf-tika-972174.pdf]
FAILED tests/test_writer.py::test_writer_clone_bookmarks - NotImplementedErro...

I'm going to look into them

MartinThoma added a commit that referenced this pull request Aug 19, 2023
While on it, pre-commit was also updated

Taken from #2086

Co-authored-by: Lucas Cimon <925560+Lucas-C@users.noreply.github.com>
MartinThoma added a commit that referenced this pull request Aug 19, 2023
While on it, pre-commit was also updated + several fixes for mypy.

Taken from #2086

Full credit to Lucas for the property-simplification.

Co-authored-by: Lucas Cimon <925560+Lucas-C@users.noreply.github.com>
Copy link
Collaborator

@pubpub-zz pubpub-zz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first comments

@@ -818,6 +818,20 @@ def _clone(
pass
super()._clone(src, pdf_dest, force_duplicate, ignore_fields)

def get_data(self) -> bytes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_data(self) -> bytes:
def get_data(self) -> bytes:

So far from I remember _data does not have the same meaning in an encodedStream (where _data contains the compressed data) and DecodedStream(where the data are clear). raising up the set_data into content stream will leave people think set_data on an encoded stream is valid where the results are not good (some side fields needs to be set also)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far from I remember _data does not have the same meaning in an encodedStream (where _data contains the compressed data) and DecodedStream(where the data are clear)

Interesting. Is this documented somewhere?

Accessing a private property (._data) from outside the ContentStream class seemed like a code smell to me,
but if there is a semantic meaning to using .data instead of get/set_data(),
I should take care about this.

raising up the set_data into content stream will leave people think set_data on an encoded stream is valid where the results are not good (some side fields needs to be set also)

I do not really understand: EncodedStreamObject.set_data() existed prior to this PR,
and has been used in several places, right?
But you mention that it should not be used / is not valid?

PageObject._push_pop_gs(original_content, self.pdf)
)
# new_content_stream = PageObject._push_pop_gs(original_content, self.pdf)
new_content_array.append(original_content)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the _push_pop_gs make the conversion to a ContentObject where the operators have been analysed. this is most propably the more consuming part of the code,however we d not need to do that : we can just append the new code at the beginning/end of the array (eventually create this array). Don't forget to add the extracontent to the object (required to be indirect ref)

Copy link
Member Author

@Lucas-C Lucas-C Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the _push_pop_gs make the conversion to a ContentObject where the operators have been analysed. this is most propably the more consuming part of the code

Exactly! 😊

Don't forget to add the extracontent to the object (required to be indirect ref)

What do you mean by extracontent?
Is something missing with the current code in this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo : I mean the extra content(text added to the first page content). What you have to remember is that stream objects that are composing the content must be added list of objects with the _add_object() function and then just store the indirect objects in the ArrayObject that is stored in the /Contents of the page Object.
Personnally, what I would do:
create an array object
copy in the existing streams. if the object is a content object, replace it with a Encoded Stream object(use of _replace_object() function)
insert at the beginning of the first encoded stream "q\n" using the set_data() function
insert at the end of the last stream the "Q\n"
then do what is requrired the Page2 content appending the encodedstream

@Lucas-C Lucas-C force-pushed the as-fast-as-pdfrw branch 2 times, most recently from e1329d6 to 9734bf3 Compare August 20, 2023 13:50
@Lucas-C
Copy link
Member Author

Lucas-C commented Aug 20, 2023

I fixed tests/test_writer.py::test_iss1601

In think this PR is ready for being reviewed, and potentially merged.

I'm planning to also get a look at improving other parts of pypdf in terms of speed,
but it may be better to do that in a separate PR.

@codecov
Copy link

codecov bot commented Aug 20, 2023

Codecov Report

Patch coverage: 96.72% and project coverage change: -0.03% ⚠️

Comparison is base (4c511f9) 94.12% compared to head (123e820) 94.10%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2086      +/-   ##
==========================================
- Coverage   94.12%   94.10%   -0.03%     
==========================================
  Files          41       41              
  Lines        7442     7460      +18     
  Branches     1471     1474       +3     
==========================================
+ Hits         7005     7020      +15     
- Misses        272      274       +2     
- Partials      165      166       +1     
Files Changed Coverage Δ
pypdf/generic/_data_structures.py 92.40% <96.36%> (+0.03%) ⬆️
pypdf/_cmap.py 93.65% <100.00%> (-0.75%) ⬇️
pypdf/_encryption.py 95.92% <100.00%> (ø)
pypdf/_page.py 93.44% <100.00%> (ø)
pypdf/_writer.py 87.63% <100.00%> (+0.01%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MartinThoma MartinThoma changed the title Draft: Making pypdf as fast as pdfrw PI: Making pypdf as fast as pdfrw Aug 20, 2023
@MartinThoma MartinThoma added soon PRs that are almost ready to be merged, issues that get solved pretty soon and removed help wanted We appreciate help everywhere - this one might be an easy start! labels Aug 20, 2023
@MartinThoma
Copy link
Member

Awesome work 😲 👍 🙏

I want to give @pubpub-zz another chance to have a look at it, so I'll not merge+release it right now. But, if possible, I would love to do it next week.

Copy link
Member

@MartinThoma MartinThoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was done

  • Remove unnecessary calls to __parse_content_stream and _push_pop_gs.
  • API changes: get_data and set_data (as well as the deprecated parts) were moved from DecodedStreamObject to StreamObject. Besides that, no changes

How this PR improves pypdf

The performance improvements for watermarking are dastic

Performance impact of this PR:

  • Text extraction speed: 2.5s -> 2.2s
  • Image extraction speed: 3.0s -> 2.6s
  • Watermarking speed: 10.8s -> 0.4s ❗ 🤯 💥 🚀

It also reduces the watermarking file size in some cases:

Risk

I don't know where the file size reduction is coming from. My suspicion is that it's related to floating point representations, e.g. 0.0 now being 0 or similar, but I didn't check. There is a slight risk in that part.

It would be ok for me to take that risk as the resulting PDF (e.g. GeoTopo-book.pdf with watermark) looks fine and I trust our CI.

pypdf/_page.py Outdated Show resolved Hide resolved
pypdf/_page.py Outdated Show resolved Hide resolved
Lucas-C and others added 2 commits August 20, 2023 21:14
Removing commited code

Co-authored-by: Martin Thoma <info@martin-thoma.de>
@MartinThoma MartinThoma merged commit 0dec208 into main Aug 26, 2023
13 of 14 checks passed
@MartinThoma MartinThoma deleted the as-fast-as-pdfrw branch August 26, 2023 09:21
@MartinThoma
Copy link
Member

Thank you very much for the improvement @Lucas-C 🙏

That is an awesome first contribution to pypdf 🥳 If you want, I will add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

@MartinThoma
Copy link
Member

Next steps:

MartinThoma added a commit that referenced this pull request Aug 27, 2023
## What's new

### Performance Improvements (PI)
-  Making pypdf as fast as pdfrw (#2086)

### Maintenance (MAINT)
-  Relax typing_extensions version (#2104)

[Full Changelog](3.15.3...3.15.4)
@stefan6419846
Copy link
Collaborator

stefan6419846 commented Aug 27, 2023

Interestingly, this seems to fix #2112 as well and avoids generating invalid watermarks, thus not being a performance improvement only, but a bugfix as well.

I will try to verify this tomorrow again when I am in the office. For #2112 some visual watermark test should probably be implemented anyway to avoid unexpectedly breaking this again in the future, as the correct watermark display cannot be verified by a simple file size check.

@MartinThoma
Copy link
Member

MartinThoma commented Aug 27, 2023

For #2112 some visual watermark test should probably be implemented anyway to avoid unexpectedly breaking this again in the future

I agree. Interestingly, I do have such a test in a private repository. That test succeeded, but maybe I didn't use the/a broken version?

I need to check if I can transfer that test.

the correct watermark display cannot be verified by a simple file size check

👍

@stefan6419846
Copy link
Collaborator

I have been able to verify that this indeed fixes the general watermarking behavior from #2112 when doing some further tests yesterday.

@MartinThoma MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-performance Non-functional change: Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants