Creating and readding ggplot plots is very slow #1258

jcpsantiago opened this issue May 18, 2020 · 2 comments
jcpsantiago opened this issue May 18, 2020 · 2 comments


jcpsantiago commented May 18, 2020


  • Read and abide by drake's code of conduct.
  • Search for duplicates among the existing issues, both open and closed.
  • Advanced users: verify that the bottleneck still persists in the current development version (i.e. remotes::install_github("ropensci/drake")) and mention the SHA-1 hash of the Git commit you install.


Saving and readding ggplot plots is very slow (~7min vs 1s running directly from the console) when the underlying data.frame has 2e5 rows and 200 cols. I'm using r_make() to run the plan.

Reproducible example

The density_plot function below is exactly what I'm using in my code.
Here is my actual plan:

The reprex slows down, but it's still <1min so something is missing.
The last step rendering the report is especially slow. The whole plan took ~2.5h while running the same code in a notebook would need around 15--20min (talking about the gist above).

#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>     filter, lag
#> The following objects are masked from 'package:base':
#>     intersect, setdiff, setequal, union

density_plot <- function(df, xvar, title, xlab, grouping) {
  x_quo <- enquo(xvar)
  group_quo <- enquo(grouping)

  ggplot(df, aes(x = !!x_quo, color = !!group_quo, fill = !!group_quo)) +
    geom_density(alpha = 0.7) +
    scale_colour_viridis_d() +
    scale_fill_viridis_d() +
      title = title,
      x = xlab,
      subtitle = paste("Calculated on:", Sys.time())

my_plan <- drake_plan(
  df = data.frame(
    group = c(rep("good", 1e5), rep("bad", 1e5)),
    values = rnorm(2e5),
    replicate(200, sample(0:1, 1000, rep = TRUE))
  my_plot = density_plot(df, values, "My plot", "values", group)

#> ▶ target df
#> ▶ target my_plot

Session info
This is the result of pprof(the_plan) with only one plot out of date


wlandau commented May 18, 2020

This is an instance of #882. Unfortunately, objects like ggplots, lms, and glms have internal environments that pack a lot of unnecessary data. This not something drake can solve, but there are ways to work around it. For lms, the biglm package lightens the load. For ggplots, the best advice I can give is to either (1) downsize the data beforehand, or (2) save the ggplot2 to an image file. For (2), you can make drake watch the image file for changes using file_out(). (Either that or make it a dynamic file).

density_plot <- function(df, xvar, title, xlab, grouping, file) {
  x_quo <- enquo(xvar)
  group_quo <- enquo(grouping)
  out <- ggplot(df, aes(x = !!x_quo, color = !!group_quo, fill = !!group_quo)) +
    geom_density(alpha = 0.7) +
    scale_colour_viridis_d() +
    scale_fill_viridis_d() +
      title = title,
      x = xlab,
      subtitle = paste("Calculated on:", Sys.time())
  ggsave(file, out)

my_plan <- drake_plan(
  df = data.frame(
    group = c(rep("good", 1e5), rep("bad", 1e5)),
    values = rnorm(2e5),
    replicate(200, sample(0:1, 1000, rep = TRUE))
  my_plot = density_plot(
    df, values,
    "My plot",

#> ▶ target df
#> ▶ target my_plot
#> Saving 7 x 5 in image

# Runs more quickly.
#> # A tibble: 21 x 4
#>    target                 elapsed    user       system    
#>    <chr>                  <Duration> <Duration> <Duration>
#>  1 coef_regression1_large 0.009s     0.005s     0.002s    
#>  2 coef_regression1_small 0.012s     0.004s     0.002s    
#>  3 coef_regression2_large 0.014s     0.003s     0.003s    
#>  4 coef_regression2_small 0.02s      0.004s     0.001s    
#>  5 df                     7.334s     7.062s     0.223s    
#>  6 large                  0.012s     0.007s     0.002s    
#>  7 my_plot                0.794s     0.62s      0.093s    
#>  8 regression1_large      0.012s     0.006s     0.003s    
#>  9 regression1_small      0.03s      0.006s     0.002s    
#> 10 regression2_large      0.009s     0.004s     0.002s    
#> # … with 11 more rows

# We do not return a value.

# But we still get a plot.

@wlandau wlandau closed this as completed May 18, 2020
thanks! for future reference:

  • selecting only the columns I need decreased build time considerably
  • In the end I decided to not save the plots as targets and render them in the Rmd report, which is much faster

