Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pages are overwritten after merging PDFs with more than 50 pages #181

Closed
davidhirtz opened this issue Sep 6, 2023 · 6 comments
Closed

Comments

@davidhirtz
Copy link

After merging exactly 50 pages with AddPage and useImportedPage the leading pages are overwritten/blank. I originally though this was an issue with another package, libmergepdf, which depends on this project, but after removing the package and directly working with yours I encountered the same bug.

Below a little test script to showcase the problem. The following loop generates two PDF files per iteration:

  1. A single page PDF (in my example generated by Dompdf, but this happens with all PDFs)
  2. A combined PDF, which merges the single page PDF to the previous stack.
$basePath = "/";

for ($i = 1; $i <= 100; $i++) {
    $dompdf = new Dompdf();
    $dompdf->setPaper('letter', 'landscape');
    $dompdf->loadHtml("<h1>Page $i</h1>");
    $dompdf->render();

    $output = $dompdf->output();

    $filename = "$basePath/page-$i.pdf";
    file_put_contents($filename, $output);

    if ($i > 1) {
        $pdf = new Fpdi();
        $filenames = ["$basePath/merged-" . ($i - 1) . ".pdf", $filename];

        foreach($filenames AS $file) {
            $pageCount = $pdf->setSourceFile($file);

            for ($page = 1; $page <= $pageCount; $page++) {
                $pageId = $pdf->importPage($page);
                $templateSize = $pdf->getTemplatesize($pageId);
                $pdf->AddPage($templateSize['orientation'], $templateSize);
                $pdf->useImportedPage($pageId);
            }
        }

        $output = $pdf->Output('S');
    }

    $filename = "$basePath/merged-$i.pdf";
    file_put_contents($filename, $output);
}

In the end I have 100 x page-{$i}.pdf and 100 x merged-{$i}.pdf files. And until merged-50.pdf everything works, the PDF has 50 pages from 1-50. But in merged-51.pdf the first two pages are blank/white and the first content is visible on page 3 (correctly printing out "Page 3"). Every iteration removes the content from another page, so that the final file merged-100.pdf has 50 empty pages and the first page with the content is page 51.

Any idea by what this is caused? The file size of merged PDF doesn't seem to change it, neither is the content, orientation or paper size... I'm totally lost here.

Thank you for your help!

@JanSlabon
Copy link
Member

The logic you use results in a very dirty PDF which seems to trigger some limitations in PDF viewers. In e.g. Chrome or Foxit you will notice the issue at 42 pages, in pdf.js is no issue at all.
The problem is that you re-import imported pages and you do this again for each page you "add". The internal structure will hold several duplicates of each page:
grafik
...the tree goes much deeper.

So to fix this just use a single run to import all pages instead of adding each after each:

$filenames = [];

for ($i = 1; $i <= 100; $i++) {
    $dompdf = new Dompdf();
    $dompdf->setPaper('letter', 'landscape');
    $dompdf->loadHtml("<h1>Page $i</h1>");
    $dompdf->render();

    $output = $dompdf->output();

    $filename = "$basePath/page-$i.pdf";
    file_put_contents($filename, $output);
    $filenames[] = $filename;
}

$pdf = new Fpdi();

foreach($filenames AS $file) {
    $pageCount = $pdf->setSourceFile($file);

    for ($page = 1; $page <= $pageCount; $page++) {
        $pageId = $pdf->importPage($page);
        $templateSize = $pdf->getTemplatesize($pageId);
        $pdf->AddPage($templateSize['orientation'], $templateSize);
        $pdf->useImportedPage($pageId);
    }
}

$output = $pdf->Output('S');

$filename = "$basePath/merged.pdf";
file_put_contents($filename, $output);

You should understand that you cannot edit a PDF with FPDI but you import the pages of an existing one and place them onto new pages. If you do this recursively the internal structure will also grow appropriately and it looks like some PDF viewers have some limit in view to such structures. This may be an issue of a nesting level of graphic states - which was documented in PDF 1.7 but was removed in 2.0. Maybe it's simply a security check as such recursion looks sus.

Generally try to avoid importing of PDFs which were imported earlier but start with the original documents.
If you really want to append a PDF to another on a low level feel free to check out our SetaPDF-Merger component. It can be initiated with an existing document instance: https://manuals.setasign.com/setapdf-merger-manual/the-main-class/#index-2

@davidhirtz
Copy link
Author

Hi @JanSlabon,

Thank you very much for this insightful reply. The code provided was just a test snippet, in the real application we can of course create the single PDFs before merging them. But I would still need to combine 100's of PDFs into a single document at some point. If I use your example code I quickly run into a "Failed to open stream: Too many open files" exception. Is there any way to close the files after adding them?

I will also take a look at the SetaPDF-Merger component, unfortunately with this project I'm pretty tight on paid solutions...

Thank you again!

@JanSlabon
Copy link
Member

[...]"Failed to open stream: Too many open files"[...]

Long time not seen this error. Especially at only 100 files. You may increase this on OS level or split this at the max number (if you do this several times, you end in the same situation as you initially reported). But again the SetaPDF-Merger component got you covered: https://manuals.setasign.com/setapdf-merger-manual/performance-optimizations/#index-2

@davidhirtz
Copy link
Author

Haha thanks, yeah the error did not happen around 100 files – more like 800 PDFs in. If there is no way to do this with an open source solution, I'll try to get the funds for the commercial licence. Thank you!

@JanSlabon
Copy link
Member

You can try to hold the files in memory: https://manuals.setasign.com/fpdi-manual/v2/the-fpdi-class/#index-4
But internally FPDI makes use of streams - I'm not sure if this counts as file-handles, too.
If not it will, for sure, increase the memory usage which may trigger the next limitation (memory_limit).

Also for this the SetaPDF-Merger got you covered as you can work with an intermediate result to free memory: https://manuals.setasign.com/setapdf-merger-manual/performance-optimizations/#index-4

@davidhirtz
Copy link
Author

Thank you again for your help. I was actually able to solve this with a simple shell_exec command and Ghostscript, this Gist was very helpful. The command only needs a few seconds for merging 800+ PDFs.

Closing this ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants