New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pdf file too big (maybe compression?) #311

Closed
Keyeoh opened this Issue Jul 6, 2012 · 8 comments

Comments

3 participants
@Keyeoh

Keyeoh commented Jul 6, 2012

Hi,

I have to plot around 500K points in several figures, and the files knitr is generating are too large. What follows is a small example reproducing the problem:

\documentclass{article}

\begin{document}

<<calc>>=
d <- data.frame(x=rnorm(500000, 0.1))
d$y <- d$x + rnorm(500000)
@

<<myplot, echo=FALSE, results='hide', out.width='\\linewidth'>>=
library(ggplot2)
p <- ggplot(d, aes(x, y)) +
  geom_point() +
  theme_bw()
print(p)
@

\end{document}

The generated PDF file is around 29MB. When I generate it by hand:

pdf('test.pdf')
p <- ggplot(d, aes(x, y)) +
  geom_point() +
  theme_bw()
print(p)
dev.off()

I get a PDF file with size 3.4MB. The files look equal. However, if I re-run the last code but using pdf device option 'compress=FALSE', I get a file with size ~30MB. That made me suspect that maybe knitr was calling with compression deactivated, but that seems strange, since that option is enabled by default.

Thank you for your amazing work.

Regards,
Gus.

@cmmp

This comment has been minimized.

Contributor

cmmp commented Jul 6, 2012

I don't really see the point of plotting half a million points, but when I have to do plots with many points, generally raster formats are better off than using vector formats, so not all points have to be stored. If you really must do it for several figures, maybe you're better off using the png device. Would be nice to know about the compression option though.

best,
Cássio.

@Keyeoh

This comment has been minimized.

Keyeoh commented Jul 6, 2012

You are right, Cássio :)

I was just playing with the default options ('pdf') when I noticed this behavior, which I thought could be interesting to share. For now, I am getting along with the 'png' without a problem.

Regards,
Gus

@yihui

This comment has been minimized.

Owner

yihui commented Jul 6, 2012

This is really weird. In theory the PDF should be compressed. I'll look into it later. Thanks!

BTW, it works in an interactive R session, though.

@yihui

This comment has been minimized.

Owner

yihui commented Jul 7, 2012

The first impression was so dominant that I spent a few hours on PDF compression, but it turns out not to be a compression problem. You must be using RStudio, which sets pdf.options(useDingbats = FALSE) before it calls knitr, which gives you the big PDF file. You can set pdf.options(useDingbats = TRUE) in your first chunk to get a smaller PDF.

Grrrr... this ruined my Friday night. I will complain to RStudio! 😠

@cmmp

This comment has been minimized.

Contributor

cmmp commented Jul 7, 2012

In my experience ZapfDingbats is evil. Depending on your use case, if it's submitting to a journal, than it's a problem because the generated pdf doesn't have that font embedded. Some journals require that all figures have fonts embedded. The IEEE submission system even goes as far as giving you 3 tries to compile your pdf, if it doesn't work you're screwed.

You can run pdffonts to check the document (this is running the sample Gus provided):

$ pdffonts sample2.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SHXXDP+CMR17                         Type 1            Builtin          yes yes no       5  0
YBUQRJ+CMR12                         Type 1            Builtin          yes yes no       6  0
VZCOTP+CMBX12                        Type 1            Builtin          yes yes no       7  0
RAQEEN+CMR10                         Type 1            Builtin          yes yes no       8  0
FYLRNW+CMTT10                        Type 1            Builtin          yes yes no       9  0
ZapfDingbats                         Type 1            ZapfDingbats     no  no  no      13  0
Helvetica                            Type 1            Custom           no  no  no      14  0

it's best to use the cairo_pdf device which embeds all fonts:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SHXXDP+CMR17                         Type 1            Builtin          yes yes no       5  0
YBUQRJ+CMR12                         Type 1            Builtin          yes yes no       6  0
VZCOTP+CMBX12                        Type 1            Builtin          yes yes no       7  0
RAQEEN+CMR10                         Type 1            Builtin          yes yes no       8  0
FYLRNW+CMTT10                        Type 1            Builtin          yes yes no       9  0
UGSFAT+NimbusSanL-Regu               Type 1            Custom           yes yes yes     16  0

the generated pdf in this case is 16M.

I still believe that for absurdly large numbers of points raster devices are better.

@yihui

This comment has been minimized.

Owner

yihui commented Jul 7, 2012

I completely agree with you that a scatter plot of a large number of points is often not useful. Thanks for the explanation and investigation!

@yihui yihui closed this in ca25f66 Jul 8, 2012

@Keyeoh

This comment has been minimized.

Keyeoh commented Jul 9, 2012

Hi Yihui and Cássio.

Sorry for the late reply. Just wanted to thank you both. Terribly sorry about not mentioning RStudio and ruining Yihui's Friday night.

Very interesting topic, though, the one about ZapfDingbats. I am sure that is a mistake I won't repeat. However, 16M still seems quite large when compared to the 3M when created by hand. RStudio seems to be failing this test.

Thank you.

Regards,
Gus

@yihui

This comment has been minimized.

Owner

yihui commented Jul 9, 2012

Well, I think I can understand why they made the decision to disable Dingbats by default. There isn't much to complain, and you can just use pdf.options(useDingbats = TRUE) in your setup to change revert their setting.

yihui added a commit that referenced this issue Oct 12, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment