Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization in tar_render_rep() #36

Closed
7 tasks done
gorgitko opened this issue Apr 6, 2021 · 4 comments
Closed
7 tasks done

Parallelization in tar_render_rep() #36

gorgitko opened this issue Apr 6, 2021 · 4 comments
Assignees

Comments

@gorgitko
Copy link

gorgitko commented Apr 6, 2021

Prework

  • Read and agree to the code of conduct and contributing guidelines.
  • Confirm that your issue is most likely a genuine bug in tarchetypes and not a known limitation, a usage error, or a bug in another package that tarchetypes depends on.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example like this one so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

When tar_render_rep() is run in parallel (tested with clustermq), intermediate knitr files with the same names (e.g. <Rmd_file>.knit.md etc.) are used for all workers, and, thus, removed before pandoc is run.

assignInNamespace("clean_tmpfiles", function() {}, ns = "rmarkdown") is used because of rstudio/rmarkdown#1632 (comment)

Reproducible example

targets::tar_dir({
  cat(getwd())

  writeLines(
    con = "tar_render_rep.Rmd",
    text = paste(
      "---",
      "title: 'tar_render_rep'",
      "params:",
      "  par: ''",
      "---",
      "```{r}",
      "cat(params$par)",
      "```",
      sep = "\n"
    )
  )

  targets::tar_script(ask = FALSE, {
    library(tarchetypes)
    library(magrittr)
    library(tidyverse)
    library(clustermq)

    assignInNamespace("clean_tmpfiles", function() {}, ns = "rmarkdown", envir = .GlobalEnv)
    options(clustermq.scheduler = "multicore")

    list(
      tar_target(df, tibble::tibble(par = LETTERS[1:20]) %>% dplyr::mutate(output_file = paste0("tar_render_rep_", 1:n(), ".html"))),
      tar_render_rep(df_rendered, "tar_render_rep.Rmd", params = df)
    )
  })

  targets::tar_make_clustermq(workers = 10)
})
#> /tmp/RtmpO2I6nF/targets_3d5831c9d32e── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
#> ✓ tibble  3.1.0     ✓ dplyr   1.0.5
#> ✓ tidyr   1.1.3     ✓ stringr 1.4.0
#> ✓ readr   1.4.0     ✓ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x tidyr::extract()   masks magrittr::extract()
#> x dplyr::filter()    masks stats::filter()
#> x dplyr::lag()       masks stats::lag()
#> x purrr::set_names() masks magrittr::set_names()
#> * Option 'clustermq.scheduler' not set, defaulting to ‘LOCAL’
#> --- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
#> ● run target df
#> ● run target df_rendered_params
#> ● run branch df_rendered_11894c68
#> ● run branch df_rendered_efc2f5be
#> ● run branch df_rendered_f06ee73c
#> ● run branch df_rendered_46ec1926
#> ● run branch df_rendered_1a3410ff
#> ● run branch df_rendered_fd246783
#> ● run branch df_rendered_915e8a36
#> ● run branch df_rendered_4b0aef2c
#> [WARNING] This document format requires a nonempty <title> element.
#>   Defaulting to 'tar_render_rep.utf8' as the title.
#>   To specify a title, use 'title' in metadata or --metadata title="...".
#> [[[WWWAARARNRNINIINNGNG]G]  ] TTThhhiiisss   dddooocccuumumemenetnn tt fo frfmormaotr reqaumtiart ers  eraqe uqinuroienrsee msap  tany o nn<oetnmieptmlptetyy > < <tetilitetlmleee>n> t e.el
#> leem me ennDtte..f
#> 
#> a  u  lDtDieenffgaa uutlltot ii'nnggt  attro_o r 'e'tntadrae_rrr__rerenendpde.erur_t_rrfee8p'p. u.atusft 8ft'8 h'a esa  stt ihttehl eet .it
#> til  To specifyt ela. titlee.
#> ,
#>   use  '  tTTiot lsepo'e  csiipnfe ycm ieaft yat diaat tltaei ,to lrue s,- eu -m'tseiet 'tt[litWaldAeR'eN 'IiantN aGm ]e t itint lmee=t"aa.dd.a.t"a. 
#> oart a- -omrTe th-ai-dsma ettdaao dctauittmale etn=it"t .l.fe.o="r".m
#> .a.t." .r
#> equires a nonempty <title> element.
#>   Defaulting to 'tar_render_rep.utf8' as the title.
#>   To specify a title, use 'title' in metadata or --metadata title="...".
#> ● run branch df_rendered_52425d62
#> [WARNING] This document format requires a nonempty <title> element.
#>   Defaulting to 'tar_render_rep.utf8' as the title.
#>   To specify a title, use 'title' in metadata or --metadata title="...".
#> [WARNING] This document format requires a nonempty <title> element.
#>   Defaulting to 'tar_render_rep.utf8' as the title.
#>   To specify a title, use 'title' in metadata or --metadata title="...".
#> pandoc: tar_render_rep.utf8.md: openBinaryFile: does not exist (No such file or directory)
#> ● run branch df_rendered_c4c44f78
#> x error branch df_rendered_52425d62
#> Warning in self$crew$finalize() :
#>   Unclean shutdown for PIDs: 16308, 16309, 16310, 16311, 16312, 16315, 16316, 16320, 16324, 16327
#> ● end pipeline
#> Error : cannot open the connection
#> In addition: Warning message:
#> 1 targets produced warnings. Run tar_meta(fields = warnings) for the messages. 
#> [WARNING] This document format requires a nonempty <title> element.
#>   Defaulting to 'tar_render_rep.utf8' as the title.
#>   To specify a title, use 'title' in metadata or --metadata title="...".
#> Error: callr subprocess failed: cannot open the connection
#> Visit https://books.ropensci.org/targets/debugging.html for debugging advice.

Created on 2021-04-06 by the reprex package (v1.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Gentoo/Linux                
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_US.UTF-8                 
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Prague               
#>  date     2021-04-06                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package     * version  date       lib source        
#>  P assertthat    0.2.1    2019-03-21 [?] CRAN (R 4.0.2)
#>  P backports     1.2.1    2020-12-09 [?] CRAN (R 4.0.2)
#>    callr         3.6.0    2021-03-28 [1] CRAN (R 4.0.4)
#>  P cli           2.3.1    2021-02-23 [?] CRAN (R 4.0.4)
#>    clustermq     0.8.95.1 2020-07-13 [1] CRAN (R 4.0.4)
#>  P codetools     0.2-18   2020-11-04 [?] CRAN (R 4.0.4)
#>  P crayon        1.4.1    2021-02-08 [?] CRAN (R 4.0.2)
#>  P data.table    1.14.0   2021-02-21 [?] CRAN (R 4.0.4)
#>  P digest        0.6.27   2020-10-24 [?] CRAN (R 4.0.2)
#>  P ellipsis      0.3.1    2020-05-15 [?] CRAN (R 4.0.2)
#>  P evaluate      0.14     2019-05-28 [?] CRAN (R 4.0.2)
#>  P fansi         0.4.2    2021-01-15 [?] CRAN (R 4.0.2)
#>  P fs            1.5.0    2020-07-31 [?] CRAN (R 4.0.2)
#>  P glue          1.4.2    2020-08-27 [?] CRAN (R 4.0.2)
#>  P highr         0.8      2019-03-20 [?] CRAN (R 4.0.2)
#>  P htmltools     0.5.1.1  2021-01-22 [?] CRAN (R 4.0.2)
#>  P igraph        1.2.6    2020-10-06 [?] CRAN (R 4.0.2)
#>  P knitr         1.31     2021-01-27 [?] CRAN (R 4.0.2)
#>  P lifecycle     1.0.0    2021-02-15 [?] CRAN (R 4.0.4)
#>  P magrittr      2.0.1    2020-11-17 [?] CRAN (R 4.0.2)
#>  P pillar        1.5.1    2021-03-05 [?] CRAN (R 4.0.4)
#>  P pkgconfig     2.0.3    2019-09-22 [?] CRAN (R 4.0.2)
#>  P processx      3.5.0    2021-03-23 [?] CRAN (R 4.0.4)
#>  P ps            1.6.0    2021-02-28 [?] CRAN (R 4.0.4)
#>  P purrr         0.3.4    2020-04-17 [?] CRAN (R 4.0.2)
#>  P R6            2.5.0    2020-10-28 [?] CRAN (R 4.0.2)
#>  P Rcpp          1.0.6    2021-01-15 [?] CRAN (R 4.0.2)
#>  P reprex        1.0.0    2021-01-27 [?] CRAN (R 4.0.2)
#>  P rlang         0.4.10   2020-12-30 [?] CRAN (R 4.0.2)
#>  P rmarkdown     2.7      2021-02-19 [?] CRAN (R 4.0.4)
#>  P sessioninfo   1.1.1    2018-11-05 [?] CRAN (R 4.0.2)
#>  P stringi       1.5.3    2020-09-09 [?] CRAN (R 4.0.2)
#>  P stringr       1.4.0    2019-02-10 [?] CRAN (R 4.0.2)
#>    styler        1.4.1    2021-03-30 [1] CRAN (R 4.0.4)
#>    targets       0.3.1    2021-03-28 [1] CRAN (R 4.0.4)
#>  P tibble        3.1.0    2021-02-25 [?] CRAN (R 4.0.4)
#>  P tidyselect    1.1.0    2020-05-11 [?] CRAN (R 4.0.2)
#>  P utf8          1.2.1    2021-03-12 [?] CRAN (R 4.0.4)
#>    vctrs         0.3.7    2021-03-29 [1] CRAN (R 4.0.4)
#>  P withr         2.4.1    2021-01-26 [?] CRAN (R 4.0.2)
#>  P xfun          0.22     2021-03-11 [?] CRAN (R 4.0.4)
#>  P yaml          2.2.1    2020-02-01 [?] CRAN (R 4.0.2)
#> 
#> [1] /mnt/raid/Users/novotnyj/projects/Vomastek/2021_01_canis_rnaseq/basenji/nf-core_reverse/R/renv/library/R-4.0/x86_64-pc-linux-gnu
#> [2] /tmp/RtmpBMvJUA/renv-system-library
#> [3] /usr/lib64/R/library
#> 
#>  P ── Loaded and on-disk path mismatch.

Expected result

Different intermediate_dir should be passed to rmarkdown::render() in order to avoid the concurrent removal of intermediate knitr files. My suggestion: for intermediate_dir use basenames of output_files or create random-named directories.

Diagnostic information

See Reproducible example


Thanks in advance for looking into this! For now, I have to run the pipeline sequentially with tar_make() to avoid this problem.

@wlandau
Copy link
Collaborator

wlandau commented Apr 6, 2021

You are right, multiple workers are trying to write to the same temporary directory. We need to set intermediates_dir to a temporary directory unique to each rendering. Should be fixed now for tar_make_clustermq() with options(clustermq.scheduler = "multiprocess"). There is still something funny going on with options(clustermq.scheduler = "multicore") though.

@wlandau
Copy link
Collaborator

wlandau commented Apr 6, 2021

For reference, sometimes I still get this when I use multicore parallelism (but no errors with multiprocess parallelism):

> tar_make_clustermq(workers = 10)
Loading required package: futurestart target report_paramsbuilt target report_paramsstart branch report_05f093abstart branch report_082d5890start branch report_b31bb919start branch report_d0f80795start branch report_c5cceb83start branch report_f3ab2a96start branch report_557b3a5estart branch report_8f41e77fstart branch report_51a54713start branch report_5779b1db
pandoc: /var/folders/k3/q1f45fsn4_13jbn0742d4zj40000gn/T//RtmpvbnUEf/rmarkdown-str17b97612e1f10.html: openBinaryFile: does not exist (No such file or directory)
pandoc: /var/folders/k3/q1f45fsn4_13jbn0742d4zj40000gn/T//RtmpvbnUEf/rmarkdown-str17ba155f17ca4.html: openBinaryFile: does not exist (No such file or directory)

After some experimentation, I am still not sure why options(clustermq.scheduler = "multicore") might still be related to issues, but it could have something to do with https://stackoverflow.com/questions/48161177/r-markdown-openbinaryfile-does-not-exist-no-such-file-or-directory. When I wrote the full path to the R Markdown source, the errors went away. But I would only recommend this as a last resort for collaborative/portable projects where the path needs to be relative or else the target becomes invalidated when the project gets moved to another file system or computer.

@gorgitko
Copy link
Author

gorgitko commented Apr 7, 2021

Thank you very much for the fix 👍

Anyway I have found some mistakes in my reprex and I am actually surprised it worked 😄 Fixed the Rmd generation and assignInNamespace() to use .GlobalEnv (now you shouldn't get the openBinaryFile: does not exist error you described, see below).

Error you are getting is related to rstudio/rmarkdown#1632 (comment) and could be fixed by a dirty fix assignInNamespace("clean_tmpfiles", function() {}, ns = "rmarkdown", envir = .GlobalEnv) (as used in the reprex). It's a really old issue and I am always wondering why nobody in RStudio has already looked into it. In short, {rmarkdown} is using some temporary files (in tempdir()) during render(), which are then removed by clean_tmpfiles(), but this function has a very general pattern for file removal, and so it also removes tempfiles for other concurrent render() calls. So this is why you encounter this error for multicore parallelism, where the same tempdir is used; multiprocess parallelism uses different tempdirs.

@wlandau
Copy link
Collaborator

wlandau commented Apr 7, 2021

Thanks for filling me in on the rest of the issue. Really helps to understand. I agree, I think the rest is outside the control of tarchetypes and would be best fixed in rstudio/rmarkdown#1632 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants