Keyeoh commented Jul 6, 2012

 Hi, I have to plot around 500K points in several figures, and the files knitr is generating are too large. What follows is a small example reproducing the problem: \documentclass{article} \begin{document} <>= d <- data.frame(x=rnorm(500000, 0.1)) d$y <- d$x + rnorm(500000) @ <>= library(ggplot2) p <- ggplot(d, aes(x, y)) + geom_point() + theme_bw() print(p) @ \end{document} The generated PDF file is around 29MB. When I generate it by hand: pdf('test.pdf') p <- ggplot(d, aes(x, y)) + geom_point() + theme_bw() print(p) dev.off() I get a PDF file with size 3.4MB. The files look equal. However, if I re-run the last code but using pdf device option 'compress=FALSE', I get a file with size ~30MB. That made me suspect that maybe knitr was calling with compression deactivated, but that seems strange, since that option is enabled by default. Thank you for your amazing work. Regards, Gus.
cmmp commented Jul 6, 2012

 I don't really see the point of plotting half a million points, but when I have to do plots with many points, generally raster formats are better off than using vector formats, so not all points have to be stored. If you really must do it for several figures, maybe you're better off using the png device. Would be nice to know about the compression option though. best, Cássio.

Keyeoh commented Jul 6, 2012

 You are right, Cássio :) I was just playing with the default options ('pdf') when I noticed this behavior, which I thought could be interesting to share. For now, I am getting along with the 'png' without a problem. Regards, Gus
yihui commented Jul 6, 2012

 This is really weird. In theory the PDF should be compressed. I'll look into it later. Thanks! BTW, it works in an interactive R session, though.
yihui commented Jul 7, 2012

 The first impression was so dominant that I spent a few hours on PDF compression, but it turns out not to be a compression problem. You must be using RStudio, which sets pdf.options(useDingbats = FALSE) before it calls knitr, which gives you the big PDF file. You can set pdf.options(useDingbats = TRUE) in your first chunk to get a smaller PDF. Grrrr... this ruined my Friday night. I will complain to RStudio! 😠
cmmp commented Jul 7, 2012

 In my experience ZapfDingbats is evil. Depending on your use case, if it's submitting to a journal, than it's a problem because the generated pdf doesn't have that font embedded. Some journals require that all figures have fonts embedded. The IEEE submission system even goes as far as giving you 3 tries to compile your pdf, if it doesn't work you're screwed. You can run pdffonts to check the document (this is running the sample Gus provided): \$ pdffonts sample2.pdf name type encoding emb sub uni object ID ------------------------------------ ----------------- ---------------- --- --- --- --------- SHXXDP+CMR17 Type 1 Builtin yes yes no 5 0 YBUQRJ+CMR12 Type 1 Builtin yes yes no 6 0 VZCOTP+CMBX12 Type 1 Builtin yes yes no 7 0 RAQEEN+CMR10 Type 1 Builtin yes yes no 8 0 FYLRNW+CMTT10 Type 1 Builtin yes yes no 9 0 ZapfDingbats Type 1 ZapfDingbats no no no 13 0 Helvetica Type 1 Custom no no no 14 0  it's best to use the cairo_pdf device which embeds all fonts: name type encoding emb sub uni object ID ------------------------------------ ----------------- ---------------- --- --- --- --------- SHXXDP+CMR17 Type 1 Builtin yes yes no 5 0 YBUQRJ+CMR12 Type 1 Builtin yes yes no 6 0 VZCOTP+CMBX12 Type 1 Builtin yes yes no 7 0 RAQEEN+CMR10 Type 1 Builtin yes yes no 8 0 FYLRNW+CMTT10 Type 1 Builtin yes yes no 9 0 UGSFAT+NimbusSanL-Regu Type 1 Custom yes yes yes 16 0  the generated pdf in this case is 16M. I still believe that for absurdly large numbers of points raster devices are better.
yihui commented Jul 7, 2012

 I completely agree with you that a scatter plot of a large number of points is often not useful. Thanks for the explanation and investigation!

Keyeoh commented Jul 9, 2012

 Hi Yihui and Cássio. Sorry for the late reply. Just wanted to thank you both. Terribly sorry about not mentioning RStudio and ruining Yihui's Friday night. Very interesting topic, though, the one about ZapfDingbats. I am sure that is a mistake I won't repeat. However, 16M still seems quite large when compared to the 3M when created by hand. RStudio seems to be failing this test. Thank you. Regards, Gus
yihui commented Jul 9, 2012

 Well, I think I can understand why they made the decision to disable Dingbats by default. There isn't much to complain, and you can just use pdf.options(useDingbats = TRUE) in your setup to change revert their setting.

