Skip to content
This repository has been archived by the owner on Jan 2, 2023. It is now read-only.

Auto page size based on content size #1627

Open
kachkaev opened this issue Apr 10, 2014 · 107 comments
Open

Auto page size based on content size #1627

kachkaev opened this issue Apr 10, 2014 · 107 comments
Labels
Milestone

Comments

@kachkaev
Copy link

kachkaev commented Apr 10, 2014

It would be nice to be able to set something like --auto-size when calling wkhtmltopdf to generate a PDF with one page of a minimum possible size. This can be currently done using a workaround, and means to avoid such hack would be useful.

PDFs with sizes that depend on the content are widely used in latex documents. One can render an HTML page and then embed the resulting PDF as a figure.

@ashkulz
Copy link
Member

ashkulz commented Apr 11, 2014

Why would you prefer a PDF in such a scenario instead of an image generated by wkhtmltoimage?

@kachkaev
Copy link
Author

PDF keeps text in vector format. I know wkhtmltoimage+SVG does the same, but SVGs can’t be included into latex without conversion to PDF by other software. This can be done with Inkscape or similar, but it’s still an extra step.

@ashkulz
Copy link
Member

ashkulz commented Apr 11, 2014

So you can include a PDF into LaTeX, and it keeps the text in vector format?

@ashkulz ashkulz added this to the future milestone Apr 11, 2014
@kachkaev
Copy link
Author

Yep, that’s the main point of using wkhtmltopdf instead of wkhtmltoimage.

@ashkulz
Copy link
Member

ashkulz commented Apr 11, 2014

I'm not too comfortable with adding this feature, as it only makes sense for single-page PDFs. Also, what happens if the text generates more than one page?

@kachkaev
Copy link
Author

Well, that only page can be of large arbitrary size – I don't see a problem here.

Another way of implementing such feature could be adding PDF as export format in wkhtmltoimage:
wkhtmltoimage --width 0 --enable-smart-width --format pdf my_page.html my_page.pdf

Advantage of such approach is more coherence in API (images don't have page margins, page numbers etc).

How about this?

@kachkaev
Copy link
Author

As mentioned earlier, to get a PDF of a minimum possible size, I'm doing HTML conversion in two steps. First, I get an SVG using wkhtmltoimage and then convert it to PDF with inkscape. Resulting files are then used in a latex project, that’s the reason why they can’t be raster or have predefined dimensions.

All works fine except one thing: the links defined in the HTML get lost.

If I try to convert HTML to PDF in one step using wkhtmltopdf, the links remain, which is great (thanks to a new feature). However, it is impossible to tell wkhtmltopdf to keep the size of a page minimum, so I can’t start using this script instead of wkhtmltoimage as automatically defined dimensions of an image are in priority. Thus, the situation is resolved only partially in both scenarios.

If wkhtmltoimage could export to PDF or wkhtmltopdf supported something like --auto-size to get custom-sized single-paged documents, the solution to the problem I describe would be possible.

Is there any way this can be done now? There is a chance I've missed something in the docs.

@ashkulz
Copy link
Member

ashkulz commented Jun 24, 2014

Not really, it will require some changes -- for which there is no time right now.

@kachkaev
Copy link
Author

OK, no worries. I wish I could help with the development, but C++ and Python are quite unfamiliar to me.

It will be absolutely great if you consider implementing something like wkhtmltopdf --auto-size in future.

@toncid
Copy link

toncid commented Jul 8, 2014

+1

One use case for this feature would be printing the generated PDF by a paper-role printer, which prints everything on one "page".

@kachkaev
Copy link
Author

kachkaev commented Jul 8, 2014

@ashkulz, it’s great that you’ve shortlisting this feature! When about are you planning to work on it?

I’m currently finding more and more difficulties in using wkhtmltoimage + inkscape to get custom-sized single-page pdfs. Some borders get lost (#1812), shadows break (#1835), extra lines are added to pdfs by inkscape, etc. These problems don’t appear when using wkhtmltopdf, but page size cannot be set to necessary minimum this way.

The only workaround I see so far is running wkhtmltoimage x.html x.png, getting the size of that png and then executing wkhtmltopdf with derived --page-height and --page-width. However, this solution does not work well too – resulting pdfs sometimes spill on a second page, and the number of the pages cannot be limited.

I could look more into this temporal hack and then share a shell script here, but what if you are willing to patch wkhtmltopdf in the next few days of weeks?

Not rushing you, just asking :–) Thanks for this awesome library again!

@lukeenglish
Copy link

I would very much be interested in this feature :) We print from rollers, so this feature would be perfect.

I could probably help out with the development too, once I have my head round the code.

@ashkulz
Copy link
Member

ashkulz commented Aug 22, 2014

@lukeenglish: patches are always welcome 👍

@lukeenglish
Copy link

How do you currently determine when to break onto a new page if no manual breaks a provided?

I assume its when the content expands over the page height.

I am more than happy to try and patch this.

@lukeenglish
Copy link

Just see this in your documentation....

The current page breaking algorithm of WebKit leaves much to be desired.
Basically webkit will render everything into one long page, and then cut it up
into pages. This means that if you have two columns of text where one is
vertically shifted by half a line. Then webkit will cut a line into to pieces
display the top half on one page. And the bottom half on another page. It will
also break image in two and so on. If you are using the patched version of QT
you can use the CSS page-break-inside property to remedy this somewhat. There is
no easy solution to this problem, until this is solved try organizing your HTML
documents such that it contains many lines on which pages can be cut cleanly.

Surely, we just need to interrupt the splitting method in a patched web kit ??

@leonelsr
Copy link

I'd start by looking into wkhtmltoimage, as it already does interrupt this splitting (or does it join everything?). Anyway, I doesn't seem difficult to implement!

@lukeenglish
Copy link

Yeah cheers for the heads up, that's where I am looking at the moment.

@slackday
Copy link

slackday commented Sep 2, 2014

+1 I'd also interested in achieving this somehow. I'd like to keep the vectors from the PDF but have the content fit on one page, no matter how big. I tried setting a big paper size but that creates unnecessary file-size and white-space when opening the file in for example Illustrator.

@toncid
Copy link

toncid commented Oct 13, 2014

Hey guys, any updates on this issue?

@forsbergplustwo
Copy link

+1 on this. Receipt printers are the main use case in our work. Unfortunately I don't have the chops to create a patch myself.

@toncid
Copy link

toncid commented Jan 14, 2015

Hello, sharing any feedback would be awesome. :)

@ashkulz
Copy link
Member

ashkulz commented Jan 14, 2015

What updates would you want? I'm not working on it, as I said above ...

@toncid
Copy link

toncid commented Jan 14, 2015

Well, @lukeenglish and @leonelsr had some ideas to begin with, so I thought that they might share their findings and estimates on whether this feature request is possible/feasible to implement.

@rickysullivan
Copy link

+1 For this.

I'm thinking I might have to wkhtmltoimage read the size then wkhtmltopdf. I need to keep links.

@Oliboy50
Copy link

Oliboy50 commented Aug 4, 2015

+1

@kenorb
Copy link

kenorb commented Aug 18, 2015

@PhilterPaper
Copy link

That link just eventually loops back to this particular issue!

By the way, before trying to produce outsized single PDF pages, someone ought to find out if there really is a semi-/un-documented limit of 200 inches (508cm) for a PDF page dimension. At least you would know there is an upper limit on what you can do.

@slackday
Copy link

This thread has been inactive for a long time since my last post. Too bad since this would be a great feature that is requested a lot.

That's a nice catch @PhilterPaper didn't know that.

I worked around this problem by setting all page margins to zero with the margin_top, margin_bottom, margin_left, margin_right parameters. And then setting the page_height and page_width to the HTML-document size by calculating the pixels to cm.

It's not 100% perfect but works well enough in my case so thought someone else would be interested in this while we keep waiting for a potential fix to this issue. :)

Accoding to http://www.translatorscafe.com/cafe/units-converter/typography/calculator/pixel-(X)-to-centimeter-[cm]/ 1px ~= 0.02645833333333cm

However I found that when setting the height it's actually a little bit bigger in the resulting PDF. I got the best results when calculating 1px width = 0.0333333 and 1px height = 0.04.

So take the width in px * 0.0333333
and height in px * 0.04

And needless to say. None of my documents have exceeded 508cm. Yet.

I will continue to follow this thread with interest :)

@antivanity
Copy link

+1

1 similar comment
@petervrs
Copy link

+1

@Sammael1106
Copy link

come on guys! very useful feature!

@vivianamarquez
Copy link

+1

@vivianamarquez
Copy link

A workaround
wkhtmltopdf -T 0 -B 0 --page-width 210mm --page-height 10000mm input.html output.pdf

@giansalex
Copy link

Yes, but i need it based on content size 😕 , without --page-height option

@vivianamarquez
Copy link

Yes, but i need it based on content size 😕 , without --page-height option

Well, my fix to that was to multiply 297 (standard height of A4 documents) by the number of pages in the document. That should give you the height based on your content.

@Sammael1106
Copy link

my previous solution was to find page-height via JS in px, then convert to millimeters (X = px*0.264583333) and then render PDF file with {page-height: X}.

But it was bad idea find height on client side, because of different fonts drawings... it's different in different browsers...
Now gonna do the same with PhantomJS on server-side or wait for wkhtmltopdf option page-height: 'auto' 😄

@muyutingfeng
Copy link

+1

@nitin-vavdiya
Copy link

+1

@theraccoonbear
Copy link

theraccoonbear commented Aug 10, 2020

Seems silly that this is still an issue. I'm calling wkhtmltopdf from a node script via a shell command. My, admittedly clunky, solution is to pull and render the PDF "normally". Then use pdfinfo to get the page count. Then, pull/render again but specifying a page height (per @vivianamarquez) of PageCount * 297. It uses twice the bandwidth (yes I could save it off locally first) and twice the rendering effort, but it works 🤷‍♂️

@lorenzos
Copy link

lorenzos commented Aug 10, 2020

I agree that an option would be great, but here's my take to avoid rendering the document twice and relying on the calculations shown here. I print my document setting a very long page (like, 5000mm for example), making sure to get a single page, and then I use pdfcrop to crop content. It works directly on the output PDF file, so there's no need to render it twice and it's very fast. You can set crop margins if you want to leave some blank space around borders.

If you need to have an exact page width, things gets more complicated, but it's possible to use pdfcrop --verbose to get the crop area size and then calculate margins accordingly.

@ashkulz ashkulz modified the milestones: future, 0.12.7 Aug 10, 2020
@PhilterPaper
Copy link

One aspect of unlimited page length is that, while pages are built from the top down, the coordinate system runs from the bottom up. Thus, the y-coordinate of the top of the page needs to be known when you start. It's too bad that Adobe chose to do this (almost every other graphics system has y=0 at the top). That's no problem for a fixed media size, but trying to build a page of unknown media length means that sooner or later you will go negative on your coordinates. You would need to make a second pass to update all y-coordinates to have y=0 at the bottom of the long page (as well as reset the MediaBox). The alternatives include making a regular first pass (normal media size) and determining the total length at the end, then making a second pass with the extended media size; or making a wkHTMLtoImage pass first to determine media size; or somehow gluing together regular pages into one; or simply oversizing the media and cropping the result afterwards. You might also be able to use a flipped coordinate system (y=0 at the top) and either make a second pass through to update/correct it, or use the cm operator to flip the coordinate system.

Don't forget that PDF officially limits you to 14400 points, unless you use a UserUnits scale factor. I think a lot of readers are lax in enforcing that rule, but you never know when you'll run into one that is strict about it.

@awendland
Copy link

awendland commented Nov 11, 2020

Not sure if this is useful, but there was a recent WebKit development related to this issue: according to this Twitter thread WebKit now contains a createPDF method that generates PDFs of web pages that are one continuous page. Unfortunately the official Apple Documentation about createPDF is sparse, but according to Ștefan Godoroja's message that is how the method functions.

@PhilterPaper
Copy link

By the way, I see a lot of comments stating that the "standard" page size is A4 (595Pt x 842Pt). Keep in mind that wkHTMLtoPDF apparently sets this (via MediaBox), and that all by itself (no MediaBox), the standard PDF page size is US Letter (8.5in x 11in, 612Pt x 792Pt). If you want US Letter page size with wkHTMLtoPDF, you need to set the page size to that explicitly.

@danielpinon
Copy link

+1

@beevabeeva
Copy link

My frustration made me throw together this dirty script.

Following @lorenzos solution, along with this script used for cropping:

#!/bin/bash
# 1. convert html to one long paged pdf (5000mm long):
wkhtmltopdf --enable-local-file-access --page-width 210mm --page-height 5000mm "$1" converted.pdf

# 2. crop the long pdf:
fname="converted.pdf"
pagesize=( $(pdfinfo "$fname" | grep "Page size" | cut -d ":" -f2 | \
    awk '{ print $1,$3 }') )
bounding=( $(pdfcrop --verbose "$fname" | grep "%%HiResBoundingBox" | \
    cut -d":" -f2 ) )
rm "${fname//.pdf/-crop.pdf}"
lmarg="${bounding[0]}"
rmarg="$(python3 -c "print('{:.3f}'.format(${pagesize[0]} - ${bounding[2]}))")"
pdfcrop --margins "$lmarg 75 $rmarg 75" "$fname" "single_page.pdf"

#cleanup the intermediate pdf file (converted.pdf, the one with all the whitespace ;))
rm "$fname"

The script takes in a single argument, input_file.html.

Assumptions:

  • I keep top and bottom margins of 75, just so it doesn't look weird. You can obviously change this.
  • I made this for a locally saved html site (hence --enable-local-file-access).
  • This script is run in a folder containing the index.html file and folder containing all related media. e.g:

image
(note above, I have saved the script in a file called script.sh and made it executable)

Open the terminal here and:
$ ./script.sh my_html_site.html

You should be left with a file called single_page.pdf

Quick and dirty, but gets it done.

Hope this feature gets added. It may be quirky but it's also insanely frustrating :P .

@asolopovas
Copy link

+1

1 similar comment
@Next-Door-Tech
Copy link

+1

@matj1
Copy link

matj1 commented Nov 26, 2021

One more reason that wkhtmltoimage isn't suitable for converting to PDF is that it doesn't preserve text. The text in the original HTML is converted to paths when the HTML file is converted to SVG, so it can't be selected and copied from the result.

@mirfatif
Copy link

As in @beevabeeva's solution, I found that pdfcrop does not preserve bookmarks. Here is a marvelous solution using gs, particularly for big PDF files.

@PackElend
Copy link

any chance that it will be supported anytime soon?

@Sicos1977
Copy link

Probably never :-) ... you can try my solution if you want --> https://github.com/Sicos1977/ChromeHtmlToPdf

                case PaperFormat.FitPageToContent:
                    PreferCSSPageSize = true;
                    break;

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests