Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with PDF output size #20

Closed
Cellomaster87 opened this issue Mar 26, 2020 · 20 comments · Fixed by #21
Closed

Issue with PDF output size #20

Cellomaster87 opened this issue Mar 26, 2020 · 20 comments · Fixed by #21

Comments

@Cellomaster87
Copy link

Cellomaster87 commented Mar 26, 2020

I'm using your tool inside of a shell script which gets run by an Automator.
I am not the author of this script but it used to work up until 2-3 weeks ago, then now whatever I try the output PDFs are not coming out the size written inside the script.

For example, the one pasted below comes out 611 x 684 instead of the correct amount.
Could you help me find the issue? Thanks!

#!/bin/bash

export PATH=$PATH:/usr/local/bin:$HOME/Desktop
EXT="_Quarto.pdf"
OUTFOLDER="$1"
shift 1

function pdf_rescale () {
    /usr/local/bin/pdfscale.sh -r 'custom pt 657.64 864.57' -s 0.985 "${1}" "${2}"
    return
}

# loop through all input files, or folders and transform A4 PDF to Quarto (232 x 305) PDF scaled to 98.5 percent
for f in "$@"
do
    # is it a directory?
    if [ -d "$f" ]; then
        # nullglob prevents expansion of "*.pdf" to variable x if no match
        shopt -s nullglob
        for x in "${f}"/*.pdf
        do
            shopt -s extglob
            pdf_rescale "${x}" "${OUTFOLDER}/${x//+(*\/|.*)}${EXT}"
            shopt -u extglob
        done
        shopt -u nullglob
    # is it a non-zero, regular file?
    elif [ -s "$f" ]; then
        shopt -s extglob
        pdf_rescale "${f}" "${OUTFOLDER}/${f//+(*\/|.*)}${EXT}"
        shopt -u extglob
    fi
done
exit 0

EDIT: with help from user VikingOSX from discussions.apple.com, I discovered that the issue presents itself only when the input file is a PDF made up combining different PDFs, for example using Adobe Acrobat.
Your tool will correctly resize the MediaBox of the file but will also apply a CropBox to it that maintains the proportion of the original file.
For example: if I want to convert such file from A4 to 9x12in the resulting file available to the user will be 12in high but 8.51in wide, as it will keep the A4 proportions.
Could you please give a look at this issue and tell me how to solve it?
Thank you very much

@tavinus
Copy link
Owner

tavinus commented Mar 30, 2020

Hi, I am not sure I understand everything, but I will try to explain what I can.

First, some questions:

  • What version of pdfScale are you using?
  • Can you post the verbose output of your run here (as code pls)?
  • Can you provide me with a PDF file example so I can try doing the same here?
  • Can you provide me with the resulting PDF that didn't go as expected?

You are running in mixed mode, resizing the paper and then scaling down.

1st Step - Resize

  • Always runs before scaling
  • Fits to page
    • Will resize paper and reposition the content
    • If the new paper has the same proportions, things will look the same
    • If the new paper has different proportions, the fitting of the contents will change
    • You can also disable the fit-to-page setting with --no-fit-to-page
  • By default it will try to detect if your are flipping portrait/landscape and correct it, you may disable this with -f disable or -f d
  • By default it will run ghostscript auto-rotation detection in auto mode, you can also disable this with -a none or -a n
  • And then you can even manually position things with vertical/horizontal alignment setting and x/y offset settings, here is the help description for it
 --hor-align, --horizontal-alignment <left|center|right>
             Where to translate the scaled page
             Default: center
             Options: left, right, center
 --vert-align, --vertical-alignment <top|center|bottom>
             Where to translate the scaled page
             Default: center
             Options: top, bottom, center
 --xoffset, --xtrans-offset <FloatNumber>
             Add/Subtract from the X translation (move left-right)
             Default: 0.0 (zero)
             Options: Positive or negative floating point number
 --yoffset, --ytrans-offset <FloatNumber>
             Add/Subtract from the Y translation (move top-bottom)
             Default: 0.0 (zero)
             Options: Positive or negative floating point number

2nd Step - Scale

  • Never affects page size, whatever came from step 1, will be kept
  • Will zoom contents inside the page, which may bleed "outside" the page

At MacOSX, there may be a problem if you are trying to process a PDF file that was just created by another script step, which may not yet have spotlight's metadata. That will happen if mdls is used because it uses this metadata. This will only happen if the PDF was just created miliseconds before.

You can force another method of size detection to make sure this is not a problem though. Try installing (from homebrew) imagemagick or xpdf and using -m i or -m p to force using one of them.

Next version will have ghostscript detection, so this should not be a problem anymore.


Let me know if this helps.
Cheers! 🍺
Gus

@tavinus
Copy link
Owner

tavinus commented Mar 30, 2020

BTW, postscript points are integers.
They should be rounded to integers even if you use floating point numbers (from what I recall).
I remember that Ghostscript would not like to receive Points as floats.

Can you try it using integers for custom paper size? Or use metric.
Can you also try using a pre-defined paper size like letter or A4?

EDIT3: Does the merged (input) PDF has each page with different paper size maybe?

@Cellomaster87
Copy link
Author

Cellomaster87 commented Mar 30, 2020 via email

@tavinus
Copy link
Owner

tavinus commented Mar 30, 2020

Hi, there are no links at your post.

Seems like the upgrade is broken on MacOSX because it uses BSD's readlink instead of GNU's.
I have already found a bash implementation of that to replace it, will come at the next version. This is related to #17

To force imagemagick, just add -m i to your call.

You can paste images here, so maybe the screenshots would help me understand the results.

Would be nice to have the actual PDF's as well though.

@Cellomaster87
Copy link
Author

Cellomaster87 commented Mar 30, 2020 via email

@tavinus
Copy link
Owner

tavinus commented Mar 30, 2020

Yep, now I got it.
Downloaded and checking.

@tavinus
Copy link
Owner

tavinus commented Mar 30, 2020

Still investigating, but this is what I have so far:

  • File Two.pdf does have some weird stuff coming with the /Mediabox definition (from grep).
  • Even though this causes an error, the page size seems to be processed accordingly.
  • Using imagemagick (or pdfinfo) will solve this problem (add -m i to call).

The verbose run of Two.pdf

$ pdfscale -v -r 'custom mm 232 305' -s 0.985 Two.pdf
pdfscale v2.4.9 - Verbose Execution
   Mixed Tasks: Resize & Scale
       Dry-Run: FALSE
    Input File: Two.pdf
   Output File: Two.CUSTOM.SCALED.pdf
 Get Page Size: Adaptive Enabled
        Method: Grep
/usr/local/bin/pdfscale: line 1497: warning: command substitution: ignored null byte in input
  Source Width: 595 postscript-points
 Source Height: 842 postscript-points
    Print Mode: Print ( auto/empty )
   Fit To Page: Enabled (default)
   Auto Rotate: PageByPage
   Flip Detect: No change needed
  Run Resizing: CUSTOM ( 658 x 865 ) pts
     New Width: 658 postscript-points
    New Height: 865 postscript-points
  Scale Factor: 0.985
    Vert-Align: CENTER
     Hor-Align: CENTER
 Translation X: 5.01 = 5.01 + 0.00 (offset)
 Translation Y: 6.59 = 6.59 + 0.00 (offset)
   Run Scaling: -1 %
    Background: No background (default)
  Final Status: File created successfully

The error is here:

/usr/local/bin/pdfscale: line 1497: warning: command substitution: ignored null byte in input

But as mentioned the page size is parsed correctly and the execution seems to proceed without problems.

This is what the grep call returns on One.pdf and Two.pdf

One.pdf

$ grep -a -e '/MediaBox' -m 1 ./One.pdf
/MediaBox [0 0 595.000000 842.000000]

Two.pdf

$ grep -a -e '/MediaBox' -m 1 ./Two.pdf
ðV ù(Õ��çKp       �a§��uV4L��ò×ç]áÐ�Àxú©AÖ0�àt~îSD?�NT�Äg¢jO�§|®I�O|C|%´�áÑu?k�Óºá�º�òÛ JÀz�È_H/üÛ
<</Contents[1422 0 R 1423 0 R 1424 0 R 1425 0 R 1426 0 R 1427 0 R 1428 0 R 1430 0 R]/CropBox[0 0 595.2756 841.8898]/MediaBox[0 0 595.2756 841.8898]/Parent 1400 0 R/Resources 1437 0 R/Rotate 0/T<</Filter/FlateDecode/First 72/Length 642/N 8/Type/ObjStm>>stream

Those weird chars are what causes the parsing problems.
Maybe I can run it through a pipe with strings or cat to mitigate the problem (eg.)

$ strings Two.pdf | grep -e '/MediaBox' -m 1
<</Contents[1422 0 R 1423 0 R 1424 0 R 1425 0 R 1426 0 R 1427 0 R 1428 0 R 1430 0 R]/CropBox[0 0 595.2756 841.8898]/MediaBox[0 0 595.2756 841.8898]/Parent 1400 0 R/Resources 1437 0 R/Rotate 0/Type/Page>>

But as mentioned, you can use -m i to solve this as well.

However, this does not seem to be the problem, since the page size is parsed correctly (even with the error).

Please note how complex the second PDF definition is and how it has a lot more stuff than the other file has. I would guess that these other things are interfering with the result.


Honestly, I am still not 100% sure I understand what the problem is?
It is a bit confusing, but seems like the proportions of the original file is maintained, right?

The resulting MediaBox size seems to be correct, so pdfScale seems to be working properly, but the CropBox seems to be keeping the original proportion and that is what ends up rendering on screen.

I am not sure why you have a cropbox defined. From what I understand, that is used in pre-press to define a page with a bleed. So they can print it a bit bigger than the actual needed size and then cut the excess later (for a better finishing and no borders).

So maybe you can config the Acrobat merger in order for it to not define a cropbox?
I would try to tinker with the merger options to see if it makes any difference.

Here are some explanations on the PDF boxes:

Anyways, let me know if this helps.
I recommend using Lightshot to create screenshots (copy to memory) and then you can just paste them here (ctrl + V). You can save the image and drag+drop here as well.


While writing this I made a few more tests and got some new info:

Example run

$ pdfscale -m i -v -r 'custom mm 232 305' -s 0.985 Two.pdf
Checking for imagemagick's identify
pdfscale v2.4.9 - Verbose Execution
   Mixed Tasks: Resize & Scale
       Dry-Run: FALSE
    Input File: Two.pdf
   Output File: Two.CUSTOM.SCALED.pdf
 Get Page Size: Adaptive Disabled
        Method: ImageMagick's Identify
  Source Width: 595 postscript-points
 Source Height: 842 postscript-points
    Print Mode: Print ( auto/empty )
   Fit To Page: Enabled (default)
   Auto Rotate: PageByPage
   Flip Detect: No change needed
  Run Resizing: CUSTOM ( 658 x 865 ) pts
     New Width: 658 postscript-points
    New Height: 865 postscript-points
  Scale Factor: 0.985
    Vert-Align: CENTER
     Hor-Align: CENTER
 Translation X: 5.01 = 5.01 + 0.00 (offset)
 Translation Y: 6.59 = 6.59 + 0.00 (offset)
   Run Scaling: -1 %
    Background: No background (default)
  Final Status: File created successfully

Notes

  • Source Width/Height is always correct (even when using grep with the error)
  • Target Width/Height is also correct
    • PTS ( 658 x 865 ) == MM ( 232 x 305 )
  • The resulting /Mediaboxes have the correct size
$ strings Two.CUSTOM.SCALED.pdf | grep -e '/MediaBox'
<</Type/Page/MediaBox [0 0 658 865]
<</Type/Page/MediaBox [0 0 658 865]
<</Type/Page/MediaBox [0 0 658 865]
. . . 
  • The resulting /Cropboxes are the ones keeping the proportion of the page
    • Their sizes are ( 634.808105 x 865.0 )
$ strings Two.CUSTOM.SCALED.pdf | grep -e '/CropBox'
/CropBox [23.191925 0 634.808105 865.0]
/CropBox [23.191803 .00003051758 634.808228 865.0]
/CropBox [23.191925 0 634.808105 865.0]
/CropBox [23.191925 0 634.808105 865.0]
/CropBox [23.191925 0 634.808105 865.0]
/CropBox [23.191803 .00003051758 634.808228 865.0]
/CropBox [23.3735352 .00003051758 634.626465 865.0]
. . .

So we at least know where the problem is now, but I sill don't know what path I should take to solve this yet.

This post seems to shed some light on the /Cropbox issue and offers a workaround.
https://stackoverflow.com/a/26989410/1273636

Your file also has a /Cropbox defined for EACH page as in the question above.

I will keep researching it.

Cheers!
Gus

@Cellomaster87
Copy link
Author

Cellomaster87 commented Mar 30, 2020 via email

@tavinus
Copy link
Owner

tavinus commented Mar 30, 2020

Seems like I have a solution to bypass the Cropboxes with the new sizes.

I am still researching the best way to implement it though.
I don't think I want to always apply the Cropboxes.
I would prefer that files like One.pdf that do not contain any \Cropboxes defined to keep it as is, without adding a cropbox on each page. I am not sure I can detect this automagically in all cases and then apply the change.

I am inclined to just add a cli parameter that will redefine all Cropboxes to the same size as the paper (Mediabox). This will be easy to implement and run, but will not be a universal/automatic solution (which would be nice).

Gus

@tavinus
Copy link
Owner

tavinus commented Mar 30, 2020

The problem is thinking on all possible outcomes and situations.

As mentioned before, this only applies to resizing, for scaling this is all irrelevant.

Possibilities

  • Resize Cropbox proportionally to the original
    • Seems to be the current behavior, even though unintended
  • Resize Cropbox to same size as Mediabox
    • Seems to be what you want, but may not always be the case
    • Seems to be the only solution for files with Cropboxes defined
  • Resize Cropboxes to custom size
    • Allows to set different Mediabox / Cropbox sizes
    • This may be important for printing jobs with bleeds

Options for now

  • Add a parameter to set a custom /CropBox size (independent to the /MediaBox size)
  • Have a parameter that sets the size of the /CropBox the same as the /MediaBox

Detecting Cropboxes would be nice, but the complexity grows a lot. Detecting is the first problem, since it may not always work (seems to be exactly the same as Mediabox detection). There is also no clear definition on what default behaviour should be used on each case, since it will always depend on what the user actually wants.

Still digging and thinking here.
I will add the readlink bash implementation to fix the MacOSX installer while I think a bit more.

@Cellomaster87
Copy link
Author

Good morning Gus!
I have read now everything and those three articles you shared are just amazing!
Before even trying any of the tools suggested I followed the instructions and opened one of the resized files in Acrobat CC 2020, then in Preferences activate the showing of all boxes (which showed up the media box in blue around the page). Then, I went Edit > Crop Pages > double click on any page and noticed this:
Screenshot 2020-03-31 at 10 59 46
I therefore clicked on "Set to zero" under "Margin control" and this fixed the page for good.

Now, as doing this for every document I export would be as slow as manually converting them one by one, I wonder if the culprit may be the PDF export engine of Apple Pages. In macOS Finder, the infos of the combined file shows Pages in the "Created by" field, which is strange as the combining software was Acrobat.
My suspicion is that the encoding softwares being different between Pages and Sibelius cause issues. In Pages the encoding software is "macOS Version 10.15.3 (Build 19D76) Quartz PDFContext", while in Sibelius it is "Qt 5.12.5".
While pdfScale has no issue in converting each one of them separately, it gets probably rightfully confused when it has to convert a PDF made up of 2A+B+2A where A is a Pages created PDF and B is a Sibelius created one.

From the screenshot you can see that margins were added to the Crop Box.
Would it be possible to set its margins to 0, simply? Maybe putting a condition that would check if the encoding software between the components is different (don't know if this is at all possible).
I think setting the size of one to the other would not be desirable as, as explained in the articles, sometimes one would want the media box to be bigger. We just want the CropBox to be itself, unmodified, therefore removing the margins would be enough I guess.

What do you think?

@tavinus
Copy link
Owner

tavinus commented Mar 31, 2020

Hi, things went a bit hectic yesterday, so I could no finish anything.
Also, the readlink -f implementation for non-GNU systems is also being a pain to test and implement (this will fix the installer/upgrader on MacOSX, Solaris, etc).

Anyways, for your specific use case I already have a solution (which will be to reset all cropboxes to the same size as the Mediabox by issuing an execution flag). On top of that I will also add the option
to manually change the Cropbox to a custom size.

It will probably be something like

--cropbox a4
--cropbox 'custom mm 200x200'
--cropbox fullsize (or any other appropriate name)

So it will probably be similar to the regular page size definition.

@Cellomaster87
Copy link
Author

Thank you so much!
I really appreciate all this!
Looking forward to seeing this in action!

@tavinus
Copy link
Owner

tavinus commented Apr 3, 2020

Hi,
I have pushed a new Branch so we can test it before I merge and release it.

https://github.com/tavinus/pdfScale/tree/v2.5

Can you please try it and let me know?
I had trouble to test on MacOSX (currently only have a VM and it does not work very well).

Things to note

  • The Installer and the Upgrader should now work on MacOSX (can you please test them for me?)
    • Please note that the Upgrader will end up failing because it will download the old version from the master branch (for now). But if the installer of the new version works, the upgrader will also work after merging it to the master branch.
  • The GREP page size detection should not have any errors with your file Two.pdf anymore (uses strings)
  • -c | --cropbox option added

Here is the --help explanation for the --cropbox parameter

 -c, --cropbox <paper>
             Resets Cropboxes on all pages to a specific paper size
             Only applies to resize mode
             <paper> can be: full | fullsize - Uses the same size as the main paper/mediabox
                             custom          - Define a custom cropbox size in inches, mm or points
                             std paper name  - Uses a paper size name (eg. a4, letter, etc)

So on your case you should just add -c full to your pdfScale call and you should be fine.


EDIT
v2.5.2 fixes a problem with curl redirects that was breaking upgrades.

@Cellomaster87
Copy link
Author

The --upgrade is still not working for me on Catalina. Here is the Terminal output:

pdfScale.sh --upgrade
readlink: illegal option -- f
usage: readlink [-n] [file ...]
pdfScale.sh v2.4.9 - Self Upgrade

Preparing download to temp folder
 > /tmp/pdfScale_20200404-160012.tar.gz
Downloading file with curl

Extracting compressed file
Extraction error.

Cleaning up downloaded files from /tmp
 > /tmp/pdfScale_20200404-160012.tar.gz > Ok
 > no temporary master folder was found to remove

I have installed it manually and the version is not correctly 2.5.2.
I ran the script adding -c full and it works perfectly.
Do you have any suggestion on how to make this script automatically be applied on the content of a folder of PDFs and possibly adding a suffix to the name of the output? Or is the original Automator I pasted in the beginning the best thing?

Thank you so much for this!
I stay at your disposal for testing the upgrading issue on macOS.

@tavinus
Copy link
Owner

tavinus commented Apr 4, 2020

The --upgrade is still not working for me on Catalina. Here is the Terminal output:

pdfScale.sh v2.4.9 - Self Upgrade

^ This was running 2.4.9, so it is normal for it not to work. Only 2.5.2 will run the upgrade properly on Macs (even though it will offer the older version, with a warning).

Proceeding with the upgrade will downgrade (until I merge with the master branch).

I was able to test on a Yosemite VM (which is when I found the problem with curl that was patched on 2.5.2).

I ran the script adding -c full and it works perfectly.
Do you have any suggestion on how to make this script automatically be applied on the content of a folder of PDFs and possibly adding a suffix to the name of the output? Or is the original Automator I pasted in the beginning the best thing?

Your automator script seems fine to what you need and I can't think any reason for it not to work with the new version.

Would be nice to have batch processing for folders included into pdfScale, but I am not sure I will be able to do it right now.

I will probably merge with master today, so everything will be easier to test and the upgrade will not downgrade anymore.

@tavinus
Copy link
Owner

tavinus commented Apr 4, 2020

To install using the v2.5 branch you need to adjust the URLs
changing master to v2.5 >

# Normal install with prompts
curl -s -o /tmp/pdfScale.sh 'https://raw.githubusercontent.com/tavinus/pdfScale/v2.5/pdfScale.sh' && bash /tmp/pdfScale.sh --install

# Automated install with --assume-yes
curl -s -o /tmp/pdfScale.sh 'https://raw.githubusercontent.com/tavinus/pdfScale/v2.5/pdfScale.sh' && bash /tmp/pdfScale.sh --install --assume-yes

# To ignore SSL, use --insecure
curl --insecure -s -o /tmp/pdfScale.sh 'https://raw.githubusercontent.com/tavinus/pdfScale/v2.5/pdfScale.sh' && bash /tmp/pdfScale.sh --install

@fabern
Copy link

fabern commented Apr 4, 2020

I believe I had a similar problem resizing PDFs in letter format to A4. Sorry if this is hijacking this thread. Just wanted to give feedback that using --cropbox A4 works marvellously for me.

Here is an example of a scientific article in letter format: https://www.hydrol-earth-syst-sci.net/23/303/2019/hess-23-303-2019.pdf

The standard command doesn't yield the desired result. Although there are some differences in the dimensions of the Media and Crop Box with respect to the original file. It shows the pdf still in letter format:
pdfscale -v -r A4 Downloads/hess-23-303-2019.pdf

Using the --cropbox argument effectively modifies how the pdf is shown:

bernharf@bernstein:~|⇒  pdfscale -v -r A4 --cropbox A4 Downloads/hess-23-303-2019.pdf
pdfscale v2.5.3 - Verbose Execution
   Single Task: Resize PDF Paper
       Dry-Run: FALSE
    Input File: Downloads/hess-23-303-2019.pdf
   Output File: Downloads/hess-23-303-2019.A4.pdf
 Get Page Size: Adaptive Enabled
        Method: Grep
  Source Width: 612 postscript-points
 Source Height: 802 postscript-points
    Print Mode: Print ( auto/empty )
  Scale Factor: Disabled (resize only)
   Fit To Page: Enabled (default)
   Auto Rotate: PageByPage
   Flip Detect: No change needed
  Run Resizing: A4 ( 595 x 842 ) pts
 Cropbox Reset: A4 ( 595 x 842 ) pts
  Final Status: File created successfully

@tavinus
Copy link
Owner

tavinus commented Apr 5, 2020

Not hijacking at all. Thanks for the feedback @fabern

From what I tested, most problematic PDFs had different cropbox sizes on different pages (some where very close but still a bit different).

If you want the cropbox reset to the SAME size as you are resizing, -c full should be the best option (will use the same size as the main resize in any case without the need to specifically set a size).

@tavinus tavinus mentioned this issue Apr 5, 2020
4 tasks
@tavinus
Copy link
Owner

tavinus commented Apr 5, 2020

Ok,

v2.5.3 was merged and released and the v2.5 branch was deleted.
Feel free to report any problems.

Cheers 🍻
Gus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants