Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating and readding ggplot plots is very slow #1258

Closed
2 of 3 tasks
jcpsantiago opened this issue May 18, 2020 · 2 comments
Closed
2 of 3 tasks

Creating and readding ggplot plots is very slow #1258

jcpsantiago opened this issue May 18, 2020 · 2 comments
Assignees

Comments

@jcpsantiago
Copy link

jcpsantiago commented May 18, 2020

Prework

  • Read and abide by drake's code of conduct.
  • Search for duplicates among the existing issues, both open and closed.
  • Advanced users: verify that the bottleneck still persists in the current development version (i.e. remotes::install_github("ropensci/drake")) and mention the SHA-1 hash of the Git commit you install.

Description

Saving and readding ggplot plots is very slow (~7min vs 1s running directly from the console) when the underlying data.frame has 2e5 rows and 200 cols. I'm using r_make() to run the plan.

Reproducible example

The density_plot function below is exactly what I'm using in my code.
Here is my actual plan: https://gist.github.com/jcpsantiago/e119a53199379a438c14e1c33f651b93

The reprex slows down, but it's still <1min so something is missing.
The last step rendering the report is especially slow. The whole plan took ~2.5h while running the same code in a notebook would need around 15--20min (talking about the gist above).

library(drake)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

density_plot <- function(df, xvar, title, xlab, grouping) {
  x_quo <- enquo(xvar)
  group_quo <- enquo(grouping)

  ggplot(df, aes(x = !!x_quo, color = !!group_quo, fill = !!group_quo)) +
    geom_density(alpha = 0.7) +
    scale_colour_viridis_d() +
    scale_fill_viridis_d() +
    labs(
      title = title,
      x = xlab,
      subtitle = paste("Calculated on:", Sys.time())
    )
}

my_plan <- drake_plan(
  df = data.frame(
    group = c(rep("good", 1e5), rep("bad", 1e5)),
    values = rnorm(2e5),
    replicate(200, sample(0:1, 1000, rep = TRUE))
  ),
  my_plot = density_plot(df, values, "My plot", "values", group)
)

make(my_plan)
#> ▶ target df
#> ▶ target my_plot

Created on 2020-05-18 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       macOS Catalina 10.15.4      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Berlin               
#>  date     2020-05-18                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
#>  backports     1.1.6   2020-04-05 [1] CRAN (R 4.0.0)
#>  base64url     1.4     2018-05-14 [1] CRAN (R 4.0.0)
#>  callr         3.4.3   2020-03-28 [1] CRAN (R 4.0.0)
#>  cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
#>  colorspace    1.4-1   2019-03-18 [1] CRAN (R 4.0.0)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
#>  devtools      2.3.0   2020-04-10 [1] CRAN (R 4.0.0)
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.0)
#>  dplyr       * 0.8.5   2020-03-07 [1] CRAN (R 4.0.0)
#>  drake       * 7.12.0  2020-03-25 [1] CRAN (R 4.0.0)
#>  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 4.0.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
#>  filelock      1.0.2   2018-10-05 [1] CRAN (R 4.0.0)
#>  fs            1.4.1   2020-04-04 [1] CRAN (R 4.0.0)
#>  ggplot2     * 3.3.0   2020-03-05 [1] CRAN (R 4.0.0)
#>  glue          1.4.0   2020-04-03 [1] CRAN (R 4.0.0)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.0.0)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.0)
#>  hms           0.5.3   2020-01-08 [1] CRAN (R 4.0.0)
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 4.0.0)
#>  igraph        1.2.5   2020-03-19 [1] CRAN (R 4.0.0)
#>  knitr         1.28    2020-02-06 [1] CRAN (R 4.0.0)
#>  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.0.0)
#>  pillar        1.4.4   2020-05-05 [1] CRAN (R 4.0.0)
#>  pkgbuild      1.0.8   2020-05-07 [1] CRAN (R 4.0.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 4.0.0)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
#>  processx      3.4.2   2020-02-09 [1] CRAN (R 4.0.0)
#>  progress      1.2.2   2019-05-16 [1] CRAN (R 4.0.0)
#>  ps            1.3.3   2020-05-08 [1] CRAN (R 4.0.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
#>  Rcpp          1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0)
#>  remotes       2.1.1   2020-02-15 [1] CRAN (R 4.0.0)
#>  rlang         0.4.6   2020-05-02 [1] CRAN (R 4.0.0)
#>  rmarkdown     2.1     2020-01-20 [1] CRAN (R 4.0.0)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.0)
#>  scales        1.1.1   2020-05-11 [1] CRAN (R 4.0.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
#>  storr         1.2.1   2018-10-18 [1] CRAN (R 4.0.0)
#>  stringi       1.4.6   2020-02-17 [1] CRAN (R 4.0.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
#>  testthat      2.3.2   2020-03-02 [1] CRAN (R 4.0.0)
#>  tibble        3.0.1   2020-04-20 [1] CRAN (R 4.0.0)
#>  tidyselect    1.0.0   2020-01-27 [1] CRAN (R 4.0.0)
#>  txtq          0.2.0   2019-10-15 [1] CRAN (R 4.0.0)
#>  usethis       1.6.1   2020-04-29 [1] CRAN (R 4.0.0)
#>  vctrs         0.2.4   2020-03-10 [1] CRAN (R 4.0.0)
#>  withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.0)
#>  xfun          0.13    2020-04-13 [1] CRAN (R 4.0.0)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Benchmarks

This is the result of pprof(the_plan) with only one plot out of date

image

@wlandau
Copy link
Collaborator

wlandau commented May 18, 2020

This is an instance of #882. Unfortunately, objects like ggplots, lms, and glms have internal environments that pack a lot of unnecessary data. This not something drake can solve, but there are ways to work around it. For lms, the biglm package lightens the load. For ggplots, the best advice I can give is to either (1) downsize the data beforehand, or (2) save the ggplot2 to an image file. For (2), you can make drake watch the image file for changes using file_out(). (Either that or make it a dynamic file).

library(drake)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(webshot)

density_plot <- function(df, xvar, title, xlab, grouping, file) {
  x_quo <- enquo(xvar)
  group_quo <- enquo(grouping)
  out <- ggplot(df, aes(x = !!x_quo, color = !!group_quo, fill = !!group_quo)) +
    geom_density(alpha = 0.7) +
    scale_colour_viridis_d() +
    scale_fill_viridis_d() +
    labs(
      title = title,
      x = xlab,
      subtitle = paste("Calculated on:", Sys.time())
    )
  ggsave(file, out)
  invisible()
}

my_plan <- drake_plan(
  df = data.frame(
    group = c(rep("good", 1e5), rep("bad", 1e5)),
    values = rnorm(2e5),
    replicate(200, sample(0:1, 1000, rep = TRUE))
  ),
  my_plot = density_plot(
    df, values,
    "My plot",
    "values",
    group,
    file_out("my_plot.png")
  )
)

make(my_plan)
#> ▶ target df
#> ▶ target my_plot
#> Saving 7 x 5 in image

# Runs more quickly.
build_times()
#> # A tibble: 21 x 4
#>    target                 elapsed    user       system    
#>    <chr>                  <Duration> <Duration> <Duration>
#>  1 coef_regression1_large 0.009s     0.005s     0.002s    
#>  2 coef_regression1_small 0.012s     0.004s     0.002s    
#>  3 coef_regression2_large 0.014s     0.003s     0.003s    
#>  4 coef_regression2_small 0.02s      0.004s     0.001s    
#>  5 df                     7.334s     7.062s     0.223s    
#>  6 large                  0.012s     0.007s     0.002s    
#>  7 my_plot                0.794s     0.62s      0.093s    
#>  8 regression1_large      0.012s     0.006s     0.003s    
#>  9 regression1_small      0.03s      0.006s     0.002s    
#> 10 regression2_large      0.009s     0.004s     0.002s    
#> # … with 11 more rows

# We do not return a value.
readd(my_plot)
#> NULL

# But we still get a plot.
webshot("my_plot.png")

Created on 2020-05-18 by the reprex package (v0.3.0)

@wlandau wlandau closed this as completed May 18, 2020
@jcpsantiago
Copy link
Author

thanks! for future reference:

  • selecting only the columns I need decreased build time considerably
  • In the end I decided to not save the plots as targets and render them in the Rmd report, which is much faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants