-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge file produced when merging (watermarking/stamping) #1951
Comments
@ucomru Do you have example files which we could use? I guess your real files are private, but maybe you can re-create one that is not? As a side-note: I've added this kind of use-case to https://github.com/py-pdf/benchmarks#watermarking-speed already and noticed that pdfrw is a lot faster + produces smaller results. See also py-pdf/benchmarks#7 It's an issue I am aware of, but of which I'm uncertain how to tackle it. |
Thanks for answer!!! I ran 2 tests ❯ pypdf_test.py
Total time: 1.640826940536499 sec.
❯ time pdftk source.pdf background bg.pdf output out-BG_pdftk.pdf
pdftk source.pdf background bg.pdf output out-BG_pdftk.pdf 0.37s user 0.07s system 139% cpu 0.317 total
> exa -l
.rw-r--r--@ 721k tm 7 Jul 22:57 bg.pdf
.rw-r--r-- 1.0M tm 7 Jul 23:10 out-BG_pdftk.pdf
.rw-r--r-- 87M tm 7 Jul 23:00 out_with_bg.pdf
.rw-r--r--@ 527 tm 7 Jul 22:59 pypdf_test.py
.rw-r--r--@ 262k tm 7 Jul 22:52 source.pdf I think it's not bad time, but (as I think) if append new method - pypdf_test.py: import time
import pypdf # pip install pypdf
t = time.time()
with open('source.pdf', 'rb') as f_pdf, \
open('bg.pdf', 'rb') as f_bkg:
pdf = pypdf.PdfReader(f_pdf) # source text pdf file
out = pypdf.PdfWriter()
for p, page in enumerate(pdf.pages):
bkg = pypdf.PdfReader(f_bkg).pages[0] # background pdf file with image
bkg.merge_page(page)
out.add_page(bkg)
with open('out_with_bg.pdf', 'wb') as f:
out.write(f)
print(f'Total time: {time.time() - t} sec.') source.pdf -- 262 k |
In your process you are adding everytime a new page based on a new copy of the bg.pdf, therefore the image is multiplied many times explaining the hudge size. This is my proposal: import pypdf
r = pypdf.PdfReader("bg.pdf")
w = pypdf.PdfWriter(clone_from="source.pdf")
for p in w.pages:
p.merge_transformed_page(r.pages[0], pypdf.Transformation(), over=False)
p[pypdf.generic.NameObject("/Contents")] = w._add_object(p.get_contents())
w.write("out.pdf") The current with this approach, the image is added only once. the replacement of the contents is because currently the Contents is an array of ContentStreams which are not IndirectObjects. Finally I think we should update the documentation |
Do you mean something like this:
? |
no : page2 is already the background it would be like:
|
add also over param in merge_page closes py-pdf#1951
Its Great! Now method
But Many Thanks!!!!! |
Environment
Which environment were you using when you encountered the problem?
$ python -m platform macOS-13.4.1-arm64-arm-64bit $ python -c "import pypdf;print(pypdf.__version__)" 3.12.0
Explanation
Good day!
Its not simple, because
pypdf.PdfReader(file).pages[0].merge_page(page)
appendpage
under thePdfReader(file).pages[0]
.Its good for the draft information, but is not good idea for the background image.
Its need to create one and one background page
bkg = pypdf.PdfReader(f_bkg).pages[0]
, and then merge it with next text pdf page.I think its a good idea, to append method .bg :
pypdf.PdfReader(file).pages[0].bg(background_page)
to appendbackground
down the text pagePdfReader(file).pages[0]
.I try to merge background file
bg.pdf
with text filesource.pdf
to theout_with_bg.pdf
, Im getting very huge 132 Mb file:but when I use
pdftk
, Im getting only 1,5 Mb file:pdftk out.pdf background bg.pdf output out-BG_pdftk.pdf out-BG_pdftk.pdf, 1.5 Mb !!!
Its too slow process with huge output file.
I think its a good idea, to add methods
.merge
and.bg
to the whole pdf obj:Code Example
How would your feature be used? (Remove this if it is not applicable.)
The text was updated successfully, but these errors were encountered: