Huge file produced when merging (watermarking/stamping) #1951

ucomru · 2023-07-07T14:38:27Z

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.4.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf.__version__)"
3.12.0

Explanation

Good day!

Sometimes I need to merge (append) background to the text pdf file.
Its not simple, because pypdf.PdfReader(file).pages[0].merge_page(page) append page under the PdfReader(file).pages[0].
Its good for the draft information, but is not good idea for the background image.
Its need to create one and one background page bkg = pypdf.PdfReader(f_bkg).pages[0], and then merge it with next text pdf page.

I think its a good idea, to append method .bg : pypdf.PdfReader(file).pages[0].bg(background_page) to append background down the text page PdfReader(file).pages[0].

The problem with multipage pdf file.
I try to merge background file bg.pdf with text file source.pdf to the out_with_bg.pdf, Im getting very huge 132 Mb file:

bg.pdf, 761 kb (contain image, simple page)
source.pdf, 680 kb (contain text, 171 pages)
out_with_bg.pdf, 132 Mb !!!

but when I use pdftk, Im getting only 1,5 Mb file:

 pdftk out.pdf background bg.pdf output out-BG_pdftk.pdf
out-BG_pdftk.pdf, 1.5 Mb !!!

Its too slow process with huge output file.

I think its a good idea, to add methods .merge and .bg to the whole pdf obj:

PdfReader(file).merge(page)
PdfReader(file).bg(page)

Code Example

How would your feature be used? (Remove this if it is not applicable.)

import pypdf

with open('source.pdf', 'rb') as f_pdf, \
        open('bg.pdf', 'rb') as f_bkg:
    pdf = pypdf.PdfReader(f_pdf) # source text pdf file
    out = pypdf.PdfWriter()
    for p, page in enumerate(pdf.pages):
        bkg = pypdf.PdfReader(f_bkg).pages[0] # background pdf file with image
        bkg.merge_page(page)
        out.add_page(bkg)
    with open('out_with_bg.pdf', 'wb') as f:
        out.write(f)

The text was updated successfully, but these errors were encountered:

MartinThoma · 2023-07-07T17:52:17Z

@ucomru Do you have example files which we could use? I guess your real files are private, but maybe you can re-create one that is not?

As a side-note: I've added this kind of use-case to https://github.com/py-pdf/benchmarks#watermarking-speed already and noticed that pdfrw is a lot faster + produces smaller results. See also py-pdf/benchmarks#7

It's an issue I am aware of, but of which I'm uncertain how to tackle it.

ucomru · 2023-07-07T20:49:08Z

@ucomru Do you have example files which we could use? I guess your real files are private, but maybe you can re-create one that is not?

As a side-note: I've added this kind of use-case to https://github.com/py-pdf/benchmarks#watermarking-speed already and noticed that pdfrw is a lot faster + produces smaller results. See also py-pdf/benchmarks#7

It's an issue I am aware of, but of which I'm uncertain how to tackle it.

Thanks for answer!!!

I ran 2 tests pypdf_test.py and pdftk:

❯ pypdf_test.py
Total time: 1.640826940536499 sec.

❯ time pdftk source.pdf background bg.pdf output out-BG_pdftk.pdf
pdftk source.pdf background bg.pdf output out-BG_pdftk.pdf  0.37s user 0.07s system 139% cpu 0.317 total

> exa -l
.rw-r--r--@ 721k tm  7 Jul 22:57 bg.pdf
.rw-r--r--  1.0M tm  7 Jul 23:10 out-BG_pdftk.pdf
.rw-r--r--   87M tm  7 Jul 23:00 out_with_bg.pdf
.rw-r--r--@  527 tm  7 Jul 22:59 pypdf_test.py
.rw-r--r--@ 262k tm  7 Jul 22:52 source.pdf

I think it's not bad time, but (as I think) if append new method - .merge_page_bg, its will be faster.
I don't now deep pdf structure, may be Im wrong... Word docx contains header and footer.
May be pdf contain so?..

pypdf_test.py:

import time

import pypdf # pip install pypdf

t = time.time()

with open('source.pdf', 'rb') as f_pdf, \
        open('bg.pdf', 'rb') as f_bkg:
    pdf = pypdf.PdfReader(f_pdf) # source text pdf file
    out = pypdf.PdfWriter()
    for p, page in enumerate(pdf.pages):
        bkg = pypdf.PdfReader(f_bkg).pages[0] # background pdf file with image
        bkg.merge_page(page)
        out.add_page(bkg)
    with open('out_with_bg.pdf', 'wb') as f:
        out.write(f)

print(f'Total time: {time.time() - t} sec.')

source.pdf -- 262 k
bg.pdf -- 721 k
out-BG_pdftk.pdf -- 1.0 M
out_with_bg.pdf -- 87 M

pubpub-zz · 2023-07-08T14:57:57Z

In your process you are adding everytime a new page based on a new copy of the bg.pdf, therefore the image is multiplied many times explaining the hudge size. This is my proposal:

import pypdf

r = pypdf.PdfReader("bg.pdf")
w = pypdf.PdfWriter(clone_from="source.pdf")

for p in w.pages:
    p.merge_transformed_page(r.pages[0], pypdf.Transformation(), over=False)
    p[pypdf.generic.NameObject("/Contents")] = w._add_object(p.get_contents())

w.write("out.pdf")

The current merge_page does not offer the parameter over, this is why I've used merge_transformed_page.(@MartinThoma, I think It would be great the add the parameter to merge_page, but to be consistent it should precede the existing expand. are you OK with this)

with this approach, the image is added only once.

the replacement of the contents is because currently the Contents is an array of ContentStreams which are not IndirectObjects.
it is also to be noted that this will leave some object unused in the file increasnig the size abnormally : : I will propose a PR to not have to do that in the future.

Finally I think we should update the documentation

MartinThoma · 2023-07-08T15:40:36Z

The current merge_page does not offer the parameter over, this is why I've used merge_transformed_page.(@MartinThoma, I think It would be great the add the parameter to merge_page, but to be consistent it should precede the existing expand. are you OK with this)

Do you mean something like this:

def merge_page(self, page2: "PageObject", expand: bool = False, backgorund: PageObject) -> None:
    ...

?

pubpub-zz · 2023-07-08T16:12:42Z

The current merge_page does not offer the parameter over, this is why I've used merge_transformed_page.(@MartinThoma, I think It would be great the add the parameter to merge_page, but to be consistent it should precede the existing expand. are you OK with this)

Do you mean something like this:
def merge_page(self, page2: "PageObject", expand: bool = False, backgorund: PageObject) -> None:
    ...
?

no : page2 is already the background it would be like:

def merge_page(self, page2: "PageObject", over: bool = True, expand: bool = False) -> None:

add also over param in merge_page closes py-pdf#1951

…rmarking) (#1952) ENH: Add the`over` parameter to `merge_page` closes #1951 closes #1953

ucomru · 2023-07-23T14:55:40Z

Its Great!

Now method page.merge_page(bg, over=False) with magic param over is more good.

my sample pdf file using pypdf.__version__ from 3.12.0 to 3.12.2 is shortening from 6.7 mb to 3.8 mb.

But pdftk is getting smaller 1.5 mb ;-)

Many Thanks!!!!!

ucomru assigned MartinThoma Jul 7, 2023

MartinThoma removed their assignment Jul 7, 2023

MartinThoma added the needs-pdf The issue needs a PDF file to show the problem label Jul 7, 2023

py-pdf deleted a comment from ucomru Jul 7, 2023

MartinThoma added the nf-performance Non-functional change: Performance label Jul 7, 2023

MartinThoma changed the title ~~pdf merge with background return is too huge file + NEED merge & bg to PdfReader(file)~~ Huge file produced when merging (watermarking/stamping) Jul 7, 2023

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 8, 2023

BUG : prevent updating page contents

59a4bac

add also over param in merge_page closes py-pdf#1951

pubpub-zz mentioned this issue Jul 8, 2023

BUG: Prevent updating page contents after merging page (stamping/watermarking) #1952

Merged

MartinThoma closed this as completed in #1952 Jul 9, 2023

MartinThoma pushed a commit that referenced this issue Jul 9, 2023

BUG: Prevent updating page contents after merging page (stamping/wate…

abd8342

…rmarking) (#1952) ENH: Add the`over` parameter to `merge_page` closes #1951 closes #1953

stefan6419846 mentioned this issue Sep 4, 2023

TST: Add test for correct rendering of watermarks #2130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge file produced when merging (watermarking/stamping) #1951

Huge file produced when merging (watermarking/stamping) #1951

ucomru commented Jul 7, 2023 •

edited

Loading

MartinThoma commented Jul 7, 2023

ucomru commented Jul 7, 2023

pubpub-zz commented Jul 8, 2023 •

edited by MartinThoma

Loading

MartinThoma commented Jul 8, 2023

pubpub-zz commented Jul 8, 2023

ucomru commented Jul 23, 2023

Huge file produced when merging (watermarking/stamping) #1951

Huge file produced when merging (watermarking/stamping) #1951

Comments

ucomru commented Jul 7, 2023 • edited Loading

Environment

Explanation

Code Example

MartinThoma commented Jul 7, 2023

ucomru commented Jul 7, 2023

pubpub-zz commented Jul 8, 2023 • edited by MartinThoma Loading

MartinThoma commented Jul 8, 2023

pubpub-zz commented Jul 8, 2023

ucomru commented Jul 23, 2023

ucomru commented Jul 7, 2023 •

edited

Loading

pubpub-zz commented Jul 8, 2023 •

edited by MartinThoma

Loading