Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging multiple PDFs appears to duplicate resources per page #239

Closed
mikeokner opened this issue Dec 17, 2015 · 3 comments
Closed

Merging multiple PDFs appears to duplicate resources per page #239

mikeokner opened this issue Dec 17, 2015 · 3 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem nf-performance Non-functional change: Performance

Comments

@mikeokner
Copy link

I'm using PyPDF2 to merge thousands of individual PDFs (generated by WeasyPrint) into a single PDF. The resulting PDF looks fine with all of the pages merged, but the resulting file is quite large. I believe this is due to fonts/images being unnecessarily duplicated per-page.

>>> from PyPDF2 import PdfFileReader
>>> pdf = PdfFileReader(f)
>>> p0 = pdf.getPage(0)                                                                                                                                                                                                                                                         
>>> p1 = pdf.getPage(1)
>>> p0['/Resources']                                                                                                                                                                                                                                                            
{'/Font': {'/f-0-0': IndirectObject(19, 0), '/f-1-0': IndirectObject(15, 0), '/f-2-0': IndirectObject(23, 0)}, '/XObject': {'/x7': IndirectObject(28, 0), '/x5': IndirectObject(27, 0)}, '/ExtGState': {'/a0': {'/ca': 1, '/CA': 1}}}
>>> p1['/Resources']                                                                                                                                                                                                                                                            
{'/Font': {'/f-0-0': IndirectObject(41, 0), '/f-1-0': IndirectObject(37, 0), '/f-2-0': IndirectObject(45, 0)}, '/XObject': {'/x7': IndirectObject(50, 0), '/x5': IndirectObject(49, 0)}, '/ExtGState': {'/a0': {'/ca': 1, '/CA': 1}}}

For /Font and at least one of the /XObject images, I believe the IndirectObjects should all point to the same object ID as they are identical in the initial separate PDFs.

@jerome-nexedi
Copy link

In a similar situation with lots of duplicated images, a workaround that worked for me was to post-process the pdf with ghostscript using -dDetectDuplicateImages=true as suggested on http://stackoverflow.com/questions/10450120/optimize-pdf-files-with-ghostscript-or-other#answer-10453202

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 8, 2022
@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Apr 22, 2022
@MartinThoma
Copy link
Member

I think #207 might have changed the situation. Can you check again?

@MartinThoma MartinThoma added the needs-pdf The issue needs a PDF file to show the problem label Jun 30, 2022
@MartinThoma
Copy link
Member

Without having the PDF that has duplications, I cannot check myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem nf-performance Non-functional change: Performance
Projects
None yet
Development

No branches or pull requests

3 participants