Merging multiple PDFs appears to duplicate resources per page #239

mikeokner · 2015-12-17T22:30:14Z

I'm using PyPDF2 to merge thousands of individual PDFs (generated by WeasyPrint) into a single PDF. The resulting PDF looks fine with all of the pages merged, but the resulting file is quite large. I believe this is due to fonts/images being unnecessarily duplicated per-page.

>>> from PyPDF2 import PdfFileReader
>>> pdf = PdfFileReader(f)
>>> p0 = pdf.getPage(0)                                                                                                                                                                                                                                                         
>>> p1 = pdf.getPage(1)
>>> p0['/Resources']                                                                                                                                                                                                                                                            
{'/Font': {'/f-0-0': IndirectObject(19, 0), '/f-1-0': IndirectObject(15, 0), '/f-2-0': IndirectObject(23, 0)}, '/XObject': {'/x7': IndirectObject(28, 0), '/x5': IndirectObject(27, 0)}, '/ExtGState': {'/a0': {'/ca': 1, '/CA': 1}}}
>>> p1['/Resources']                                                                                                                                                                                                                                                            
{'/Font': {'/f-0-0': IndirectObject(41, 0), '/f-1-0': IndirectObject(37, 0), '/f-2-0': IndirectObject(45, 0)}, '/XObject': {'/x7': IndirectObject(50, 0), '/x5': IndirectObject(49, 0)}, '/ExtGState': {'/a0': {'/ca': 1, '/CA': 1}}}

For /Font and at least one of the /XObject images, I believe the IndirectObjects should all point to the same object ID as they are identical in the initial separate PDFs.

The text was updated successfully, but these errors were encountered:

jerome-nexedi · 2016-07-22T03:12:54Z

In a similar situation with lots of duplicated images, a workaround that worked for me was to post-process the pdf with ghostscript using -dDetectDuplicateImages=true as suggested on http://stackoverflow.com/questions/10450120/optimize-pdf-files-with-ghostscript-or-other#answer-10453202

MartinThoma · 2022-06-30T05:42:45Z

I think #207 might have changed the situation. Can you check again?

MartinThoma · 2022-06-30T05:43:06Z

Without having the PDF that has duplications, I cannot check myself.

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 8, 2022

MartinThoma added the nf-performance Non-functional change: Performance label Apr 22, 2022

MartinThoma added the needs-pdf The issue needs a PDF file to show the problem label Jun 30, 2022

MartinThoma closed this as completed Aug 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging multiple PDFs appears to duplicate resources per page #239

Merging multiple PDFs appears to duplicate resources per page #239

mikeokner commented Dec 17, 2015

jerome-nexedi commented Jul 22, 2016

MartinThoma commented Jun 30, 2022

MartinThoma commented Jun 30, 2022

Merging multiple PDFs appears to duplicate resources per page #239

Merging multiple PDFs appears to duplicate resources per page #239

Comments

mikeokner commented Dec 17, 2015

jerome-nexedi commented Jul 22, 2016

MartinThoma commented Jun 30, 2022

MartinThoma commented Jun 30, 2022