-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External packages as dependencies #6
Comments
Would it make sense to have a "package import" list, as part of the Import step? Have a single target for all the internal functions in addition to all the functions that are actually called in |
I tried the latter at one point, but I decided to keep imports totally separate since v2.0.0. I was thinking of maybe
I am on the fence about whether to track |
it might be simplest to code to have the package itself be the import target. This would err on the side of rebuilding too much, since an update to a package might not touch the few functions that I care about, though. |
That does seem like the most practical option because it avoids the problem of having a massive import list. We could
Concerns:
|
The cheap way to get around [2] would be to track version number (for CRAN), or commit hash (github). That would offload the hashing work to the maintainers, rather than the users. At that point, the main danger would be a user happily hacking away at their local version of a package, and not making a commit before I fell is a situation which, at least in the short term, could be dealt with using a section in the |
That makes sense, and I believe this is what packrat does. For now, we could just assume local packages do not have a CRAN/Bioconductor presence or a GitHub commit hash, and as you said, put the obvious pitfalls in the caution vignette. I should say that I am waiting to develop on this and other issues to see if my company's open source procedure changes soon. We have a solid draft of the new procedure, and it is on route to final approval. I will meet this afternoon with a gatekeeper from "Quality", who will hopefully give me a better idea of when I can just push to drake without bureaucratic delays. |
It is exciting! The gatekeeper I met today was extremely supportive and willing to work fast. There is barely any editing left. On the other hand, many people have to see it and sign off, so I do not know exactly what to expect. I look forward to seeing what you have done for #40. I think that will also help to free up more development without fear of merge conflicts. |
Regarding the original issue, tracking overall packages strikes me as a straightforward way to reproducibly track compiled C/C++/Fortran code (for use in |
I thought more about this, and I am leaning more and more toward tracking the versions/hashes of CRAN/Bioconductor/GitHub/Bitbucket packages. However, we should suppress this feature for versions 4.1.0 and prior. Otherwise, the a drake update will trigger rebuilds in everyone's workflows. We're already in major version 4, and I need to be better about respecting back-compatibility. For a custom local packages, things get a little trickier. I think we should stick to a simple, straightforward fix that is unlikely to surprise users. Maybe that means we do not track them at all, which would make sense because local packages are an edge case anyway. The user could still recursively walk through package functions by appending to the execution environment for envir <- as.environment(
c(
as.list(envir),
as.list("custom_package_1"),
as.list("custom_package_2")
)
) This hack belongs in the caution vignette anyway. Also, we will need to update |
I wonder if it might be worth it to add a More relevant though, it might be best to wrap this up in target command
1 dplyr packageVersion('dplyr') == '0.7.2'
2 ggplot2 packageVersion('ggplot2') == '2.2.1' I'm still trying to figure out how to make the |
Not unrelated: Should we also track |
Thankfully, load_basic_example()
make(my_plan, verbose = FALSE)
s <- session()
s$otherPkgs$drake$Version
## "4.1.1.9000"
s$R.version$major
## "3"
s$R.version$minor
## "4.0" It is a good thought to consider package dependencies via
I do like your thinking on this, though. Ultimately, the fix should look this clean and elegant. |
Here are some of the changes I am thinking about. deps()Current behavior: head(deps(lm))
## [1] ".getXlevels" "as.vector" "attr" "c" "eval"
## [6] "gettextf" This is already wrong because deps(lm)
## "package:stats"
deps(knitr::knit)
## "package:knitr" No back compatibility concerns here. New function
|
Did a little digging this morning, and we might want to make use of I don't have any bioc or bitbucket repos installed, but it's got comparable performance for packages installed from CRAN of Github. Mostly making a note here, before I start work on other stuff, but later this week, I hope to look at how to integrate this into > microbenchmark::microbenchmark(CRAN1 = devtools:::local_sha("ggplot2"), CRAN2 = devtools:::local_sha("drake"), GH1 = devtools:::local_sha("lintr"), GH2 = devtools:::local_sha("hrbrthemes"))
Unit: milliseconds
expr min lq mean median uq max neval
CRAN1 2.662 2.7565 2.93691 2.8550 3.0195 3.563 100
CRAN2 2.606 2.7025 2.87879 2.7925 2.9690 4.465 100
GH1 2.562 2.6680 2.86591 2.7555 2.9815 3.673 100
GH2 2.607 2.7170 3.00867 2.8620 3.0770 5.770 100
> list(CRAN1 = devtools:::local_sha("ggplot2"), CRAN2 = devtools:::local_sha("drake"), GH1 = devtools:::local_sha("lintr"), GH2 = devtools:::local_sha("hrbrthemes"))
$CRAN1
[1] "2.2.1"
$CRAN2
[1] "4.1.0"
$GH1
[1] "15fd8ee6685248b7f477c949c1e78dc13350cd15"
$GH2
[1] "1fd0301ce07f3e025f1a259008ea74f251f9d48b" |
Also a note: > getNamespaceName(environment(dplyr::lag))
name
"dplyr"
> getNamespaceName(environment(stats::lag))
name
"stats"
> getNamespaceName(environment(base::lag)) #Does not exist
Error in get(name, envir = ns, inherits = FALSE) : object 'lag' not found
> getNamespaceName(environment(lag)) #Default Search Path
name
"stats"
> getNamespaceName(environment(mutate)) # Not on Search Path Yet.
Error in environment(mutate) : object 'mutate' not found
> find("lag")
[1] "package:stats"
> find("mutate")
character(0)
> find("dplyr::lag")
character(0)
> find("dplyr::mutate")
character(0)
> suppressPackageStartupMessages(library("dplyr"))
> getNamespaceName(environment(dplyr::lag))
name
"dplyr"
> getNamespaceName(environment(stats::lag))
name
"stats"
> getNamespaceName(environment(lag)) #Default Search Path
name
"dplyr"
> find("lag")
[1] "package:dplyr" "package:stats"
> find("mutate")
[1] "package:dplyr"
> find("dplyr::lag")
character(0)
> find("dplyr::mutate")
character(0)
> |
I just submitted an SO post in hopes of finding an already-exported version of |
Looks good. If there aren't any bites on that, the answers to this SO question are informative. The main options being:
Neither would add anything to |
I vote for (1) if it comes to that, but it will be difficult to extricate |
Agreed. |
Needs fixing, testing and back compatibility
I did some more work on this, and @jimhester's eapply() one-liner is turning out to be the most practical way of doing things. As much as I would like to hash compiled code, we only reliably have the shared library in the package's file system, and I think it would be unwise to hash that. And hashing a package's datasets would be time-consuming and usually unnecessary. Let's stick to just the functions. I no longer think we need a separate tool. I also noticed that the At this point, I need to fix some difficult bugs, enforce back compatibility, and see how much useful testing I can add. |
Changed the back-compatibility test to use eply::strings. That way, I can be sure that "package:eply" is not an import/dependency node for projects created with early versions of drake.
Correction: I am using Also, it may not be enough to remove the withr::with_dir(new = tempdir(), code = {
# Remove srcrefs from all the functions,
# then hash them all together.
hash_without_srcref <- function(pkg){
digest::digest(
eapply(
asNamespace(pkg),
function(x) {
attr(x, "srcref") <- NULL
x
},
all.names = TRUE
)
)
}
# Install a dummy package in a local library.
lib <- "local_lib"
dir.create(lib)
pkgenv <- new.env()
pkgenv$newfunction <- function(x){
x + 1
}
withr::with_message_sink(
new = tempfile(),
code = {
package.skeleton(name = "newpkg", environment = pkgenv)
}
)
unlink(file.path("newpkg", "man"), recursive = TRUE)
install.packages("newpkg", type = "source", repos = NULL,
lib = lib, quiet = TRUE)
# Install the same package in a different local library.
lib2 <- "local_lib2"
dir.create(lib2)
install.packages("newpkg", type = "source", repos = NULL,
lib = lib2, quiet = TRUE)
# Compute the hash of newpkg in each local library.
withr::with_libpaths(
new = c(lib, .libPaths()),
code = {
unloadNamespace("newpkg")
hash1 <- hash_without_srcref("newpkg")
}
)
withr::with_libpaths(
new = c(lib2, .libPaths()),
code = {
unloadNamespace("newpkg")
hash2 <- hash_without_srcref("newpkg")
}
)
# I want them to be equal, but they are different.
testthat::expect_equal(hash1, hash2)
}) |
Another concern about this whole issue: even if I deparse all the functions before hashing, some package hashes turn out different on different platforms. Example: knitr. On the pkg_hash <- function(pkg){
digest::digest(eapply(asNamespace(pkg),
FUN = deparse, all.names = TRUE))
} On Windows R-3.4.0: packageVersion("knitr")
## [1] ‘1.17’
pkg_hash("knitr")
## [1] "70d9000607afbf3b1bf87651d73ce55d" On Red Hat Linux R-3.4.0: packageVersion("knitr")
## [1] ‘1.17’
pkg_hash("knitr")
## [1] "ae52c9275b69186123ef062f98d42b15" In my opinion, these hashes overreact to differences in installed instances of packages. I think we will have to go with the most transparent thing: just depend on package versions. It's not ideal, but the users will know exactly what they are getting. |
The current solution to this issue is sufficiently implemented, tested, and documented for the master branch. For anyone just joining the thread, you might have a look at the new addendum to the reproducibility section of the README, or the analogous mentions in the drake.Rmd vignette or quickstart vignette. There are obvious flaws in exclusively watching the version number of a package, but the other solutions are not much better. It is time to prioritize something straightforward. That way, users know exactly what they are getting and thus have complete control. The extra documentation and the additional functionality in |
Got bogged down with other work, but I think I might make working towards this a big part of my Hacktober projects. My current line of thought is to create a new package that would be able to detect changes in other packages, probably by doing most of the heavy lifting once when the package loads, and then looking for incremental changes after that. |
A nifty thing that I found re: load_all() and customized functions in the workspace though, is that direct comparison to the package function is faster than hashing. I need to check more the ensure that this holds for most functions, but with the ones that I've tried so far, it's been good. > library(tidyverse)
> library(microbenchmark)
> library(digest)
> stored_hash <- digest(read_csv)
> read_csv <- function(...){cat(file);readr::read_csv(...)}
> times <- microbenchmark(
+ direct = identical(read_csv, readr::read_csv),
+ hashes = {new_hash <- digest(read_csv); new_hash == stored_hash}
+ )
> times %>% autoplot()
> times
Unit: microseconds
expr min lq mean median uq max neval
direct 6.565 9.2310 11.24539 11.488 12.7190 19.282 100
hashes 43.898 47.3845 310.99931 67.282 68.5135 24725.754 100 |
Interesting. I wonder how long a function would have to be before hashing becomes the faster approach. The complete solution using package versions is already implemented and documented in development |
Another note: I just tried |
See #103. After thinking it over, I believe tracking packages is outside the scope of |
Sometimes I write custom one-off packages and develop them alongside the drake workflows I am working on. So maybe the code analysis should walk deeper into functions from packages. The current behavior is to walk all the way through the functions in the environment (to discover and track any functions nested in user-defined functions) but stop at functions from packages (including base).
The text was updated successfully, but these errors were encountered: