Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External packages as dependencies #6

Closed
wlandau-lilly opened this issue Mar 4, 2017 · 49 comments
Closed

External packages as dependencies #6

wlandau-lilly opened this issue Mar 4, 2017 · 49 comments

Comments

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Mar 4, 2017

Sometimes I write custom one-off packages and develop them alongside the drake workflows I am working on. So maybe the code analysis should walk deeper into functions from packages. The current behavior is to walk all the way through the functions in the environment (to discover and track any functions nested in user-defined functions) but stop at functions from packages (including base).

@AlexAxthelm
Copy link
Collaborator

AlexAxthelm commented Sep 7, 2017

Would it make sense to have a "package import" list, as part of the Import step? Have a single target for all the internal functions in addition to all the functions that are actually called in plan$command?

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Sep 7, 2017

I tried the latter at one point, but I decided to keep imports totally separate since v2.0.0. I was thinking of maybe

  1. a recursive version of import_dependencies() that dives into non-base functions, and/or
  2. appending all non-base package functions to candidate imports and let the rest of build_graph() figure out the rest.

I am on the fence about whether to track base, and to be honest, whether to dive into packages at all.

@AlexAxthelm
Copy link
Collaborator

it might be simplest to code to have the package itself be the import target. This would err on the side of rebuilding too much, since an update to a package might not touch the few functions that I care about, though.

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Sep 7, 2017

That does seem like the most practical option because it avoids the problem of having a massive import list. We could

  1. Look at all the imports.
  2. Look at where each imported object/function is defined (i.e., environment(knit)).
  3. If the import is from a package, list that whole package as a dependency.

import_dependencies() could take care of this (making sure to check both imports and the dependencies of those imports), but other things need to be adjusted, including graphing and build.R. All package targets should probably be named something like package:xyz, and we will probably need an import_package() function in build.R.

Concerns:

  1. Back-compatibility will be affected. Old workflows would need to rerun from scratch. (Would we need to increase the major version to 5? I would rather not have to do that.)
  2. Hashing packages could be time-consuming. Not all packages respect the 5MB limit. (This is minor.)

@wlandau-lilly wlandau-lilly changed the title Track functions from packages. Track packages as dependencies Sep 7, 2017
@wlandau-lilly wlandau-lilly changed the title Track packages as dependencies Track dependencies from external packages Sep 7, 2017
@AlexAxthelm
Copy link
Collaborator

The cheap way to get around [2] would be to track version number (for CRAN), or commit hash (github). That would offload the hashing work to the maintainers, rather than the users. At that point, the main danger would be a user happily hacking away at their local version of a package, and not making a commit before make.

I fell is a situation which, at least in the short term, could be dealt with using a section in the caution vignette.

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Sep 7, 2017

That makes sense, and I believe this is what packrat does. For now, we could just assume local packages do not have a CRAN/Bioconductor presence or a GitHub commit hash, and as you said, put the obvious pitfalls in the caution vignette.

I should say that I am waiting to develop on this and other issues to see if my company's open source procedure changes soon. We have a solid draft of the new procedure, and it is on route to final approval. I will meet this afternoon with a gatekeeper from "Quality", who will hopefully give me a better idea of when I can just push to drake without bureaucratic delays.

@AlexAxthelm
Copy link
Collaborator

That is exciting! I've had a few other projects pop up lately, but I'm hoping to put a away a few files on #40 today, since that seems like the most reasonable step at this point. I think that #77 / #79 is complex enough that having clear direction on push/pull requirements will help a lot.

@wlandau-lilly
Copy link
Collaborator Author

It is exciting! The gatekeeper I met today was extremely supportive and willing to work fast. There is barely any editing left. On the other hand, many people have to see it and sign off, so I do not know exactly what to expect.

I look forward to seeing what you have done for #40. I think that will also help to free up more development without fear of merge conflicts.

@wlandau-lilly
Copy link
Collaborator Author

Regarding the original issue, tracking overall packages strikes me as a straightforward way to reproducibly track compiled C/C++/Fortran code (for use in .Call(), etc.). I do not know how else we could do this easily.

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Sep 10, 2017

I thought more about this, and I am leaning more and more toward tracking the versions/hashes of CRAN/Bioconductor/GitHub/Bitbucket packages. However, we should suppress this feature for versions 4.1.0 and prior. Otherwise, the a drake update will trigger rebuilds in everyone's workflows. We're already in major version 4, and I need to be better about respecting back-compatibility.

For a custom local packages, things get a little trickier. I think we should stick to a simple, straightforward fix that is unlikely to surprise users. Maybe that means we do not track them at all, which would make sense because local packages are an edge case anyway. The user could still recursively walk through package functions by appending to the execution environment for make().

envir <- as.environment(
  c(
    as.list(envir),
    as.list("custom_package_1"),
    as.list("custom_package_2")
  )
)

This hack belongs in the caution vignette anyway.

Also, we will need to update tracked() and deps() to include all the tracked packages.

@AlexAxthelm
Copy link
Collaborator

I wonder if it might be worth it to add a drake_version field to config, so that we can make drake aware of situations like this?

More relevant though, it might be best to wrap this up in plan(), rather than make() so that previously created plans run without issue, but any new plans generated going forward would have packages included. That would also make explicit the expected package version in the plan, rather than hiding in environment data.

   target                              command
1   dplyr   packageVersion('dplyr') == '0.7.2'
2 ggplot2 packageVersion('ggplot2') == '2.2.1'

I'm still trying to figure out how to make the import have a dependency on the package version though. It would make sense to only re-run objects that have been affected by updates.

@AlexAxthelm
Copy link
Collaborator

Not unrelated: Should we also track R_version in the same way?

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Sep 11, 2017

Thankfully, session() already covers the versions of drake and R.

load_basic_example()
make(my_plan, verbose = FALSE)
s <- session()
s$otherPkgs$drake$Version
## "4.1.1.9000"
s$R.version$major
## "3"
s$R.version$minor
## "4.0"

It is a good thought to consider package dependencies via plan(). However, I foresee a few issues.

  1. Currently, imports cannot depend on targets in the workflow plan.
  2. Users might create workflow plan data frames with functions other than plan().
  3. Not all package dependencies can be detected this way unless all the imports are also listed as targets in the plan.

I do like your thinking on this, though. Ultimately, the fix should look this clean and elegant.

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Sep 12, 2017

Here are some of the changes I am thinking about.

deps()

Current behavior:

head(deps(lm))
## [1] ".getXlevels" "as.vector"   "attr"        "c"           "eval"       
## [6] "gettextf"  

This is already wrong because drake does not do code analysis on lm(). Desired behavior:

deps(lm)
## "package:stats"
deps(knitr::knit)
## "package:knitr"

No back compatibility concerns here.

New function is_package()

  • Just checks for the pattern "package:.*". To be used in deps(), etc.
  • Should be vectorized.
  • sanitize_plan() and sanitize_target() should check is_package(targets) and throw an informative error if any target names look like package names. I do not think we need to make this part back compatible since few users are likely to have target names like this anyway.

build_graph()

  • Ignore the following for drake <= 4.1.0. This bulleted list is the only place we should have to control for back compatibility.
  • Detect package dependencies from dependency_list. Check the elements and the names, but exclude targets.
  • Extend dependency_list to account for imports from packages.
  • Proceed with keys, vertices, etc..

build.R

  • import_target() call a new function package_value() if is_package(target) is TRUE.
  • package_value() should be a version number, git sha key, or other fingerprint.
  • There should be a new store_package() function for store_target().
store_package <- function(target, value) {
  config$cache$set(key = target, value = list(type = "package",
    value = value))
}

plot_graph(), etc.

  • Add a new diamond shape for packages in both the main graph and legend_nodes.
  • The node id should begin with the "package:" prefex, but the label text should not.
  • Show the version number or git hash, in the hover text.

Testing: test-packages.R

It would be impractical to install packages inside a unit test. I think a reasonable workaround is to

  • create a special small workflow
  • hack into the storr cache and plant the wrong package version number
  • call build() manually on each target
  • verify that the last targets built correctly.
  • call make()
  • verify that make() corrected the package version and built the targets that (indirectly) depended on it.
  • call make() again to verify that everything is up to date.

Testing for back compatibility

I am not sure if we can, or even should, build this into a unit test. However, it is worth building a project using functions from packages and drake 4.1.0, then check that everything is up to date for the later drake.

Documentation

The caution vignette should have a note about how packages are tracked and the dangers of custom local packages installed from the source.

By the way, for packages installed from the source, I think version numbers should be fine. Such packages could be personal projects by the user, or they could just be uncooperative external non-CRAN packages that need to be installed this way. Most installations from the source are probably the latter.

For packages loaded with devtools::load_all(), I do not know what will happen. In fact, the caution vignette should advise against loading packages this way for make().

Lingering thoughts

Am I missing anything? I think this is simple enough to implement, and I think it is the right way to go. It did just occur to me that users who build custom workflows as packages will have an even harder time, but the change is still worth it.

@AlexAxthelm
Copy link
Collaborator

Did a little digging this morning, and we might want to make use of devtool:::local_sha() (Not exported)

I don't have any bioc or bitbucket repos installed, but it's got comparable performance for packages installed from CRAN of Github.

Mostly making a note here, before I start work on other stuff, but later this week, I hope to look at how to integrate this into function_dependencies().

> microbenchmark::microbenchmark(CRAN1 = devtools:::local_sha("ggplot2"), CRAN2 = devtools:::local_sha("drake"), GH1 = devtools:::local_sha("lintr"), GH2 = devtools:::local_sha("hrbrthemes"))
Unit: milliseconds
  expr   min     lq    mean median     uq   max neval
 CRAN1 2.662 2.7565 2.93691 2.8550 3.0195 3.563   100
 CRAN2 2.606 2.7025 2.87879 2.7925 2.9690 4.465   100
   GH1 2.562 2.6680 2.86591 2.7555 2.9815 3.673   100
   GH2 2.607 2.7170 3.00867 2.8620 3.0770 5.770   100
> list(CRAN1 = devtools:::local_sha("ggplot2"), CRAN2 = devtools:::local_sha("drake"), GH1 = devtools:::local_sha("lintr"), GH2 = devtools:::local_sha("hrbrthemes"))
$CRAN1
[1] "2.2.1"

$CRAN2
[1] "4.1.0"

$GH1
[1] "15fd8ee6685248b7f477c949c1e78dc13350cd15"

$GH2
[1] "1fd0301ce07f3e025f1a259008ea74f251f9d48b"

@AlexAxthelm
Copy link
Collaborator

Also a note: getNamespaceName(environment(dplyr::lag)) works, where find("dplyr::lag") does not. find() requires the namespace to be explicitly loaded previously, and not be referenced by the :: namespace identifier. find() returns all the packages with a function name, but the getNamespaceName method shows the actual relevant one on the search path.

> getNamespaceName(environment(dplyr::lag))
   name
"dplyr"
> getNamespaceName(environment(stats::lag))
   name
"stats"
> getNamespaceName(environment(base::lag)) #Does not exist
Error in get(name, envir = ns, inherits = FALSE) : object 'lag' not found
> getNamespaceName(environment(lag)) #Default Search Path
   name
"stats"
> getNamespaceName(environment(mutate)) # Not on Search Path Yet.
Error in environment(mutate) : object 'mutate' not found

> find("lag")
[1] "package:stats"
> find("mutate")
character(0)
> find("dplyr::lag")
character(0)
> find("dplyr::mutate")
character(0)
> suppressPackageStartupMessages(library("dplyr"))
> getNamespaceName(environment(dplyr::lag))
   name
"dplyr"
> getNamespaceName(environment(stats::lag))
   name
"stats"
> getNamespaceName(environment(lag)) #Default Search Path
   name
"dplyr"
> find("lag")
[1] "package:dplyr" "package:stats"
> find("mutate")
[1] "package:dplyr"
> find("dplyr::lag")
character(0)
> find("dplyr::mutate")
character(0)
>

@wlandau-lilly
Copy link
Collaborator Author

I just submitted an SO post in hopes of finding an already-exported version of devtools:::local_sha(). Good call on getNamespaceName() vs find(). We may want to wrap getNamespaceName() in a tryCatch() statement, I think it should return empty-handed instead of throwing an error.

@AlexAxthelm
Copy link
Collaborator

Looks good. If there aren't any bites on that, the answers to this SO question are informative. The main options being:

  1. copy the code itself directly into drake, and provide attribution
  2. Use fun = getFromNamespace("fun", "pkg") to get around CRAN's restriction on the ::: operator

Neither would add anything to drake's dependencies that isn't actually there, but [2] would mean that a change to devtools:::local_sha() might change the local hashes that drake uses. I don't know how you feel about maintaining a drake copy of that function.

@wlandau-lilly
Copy link
Collaborator Author

I vote for (1) if it comes to that, but it will be difficult to extricate local_sha() from the rest of the devtools internals.

@wlandau-lilly wlandau-lilly changed the title Track dependencies from external packages External packages as dependencies Sep 13, 2017
@AlexAxthelm
Copy link
Collaborator

Agreed.
If we do end up going route 1, I wonder if it would be preferable to make it its own package, rather than a part of drake.

@wlandau-lilly wlandau-lilly removed this from the v4.2.0 CRAN release milestone Sep 29, 2017
wlandau-lilly added a commit that referenced this issue Oct 1, 2017
Needs fixing, testing and back compatibility
@wlandau-lilly
Copy link
Collaborator Author

I did some more work on this, and @jimhester's eapply() one-liner is turning out to be the most practical way of doing things. As much as I would like to hash compiled code, we only reliably have the shared library in the package's file system, and I think it would be unwise to hash that. And hashing a package's datasets would be time-consuming and usually unnecessary. Let's stick to just the functions. I no longer think we need a separate tool.

I also noticed that the base package throws a warning when I try to hash it, so for that one, I am hashing R.version instead.

At this point, I need to fix some difficult bugs, enforce back compatibility, and see how much useful testing I can add.

wlandau-lilly added a commit that referenced this issue Oct 1, 2017
Changed the back-compatibility test to use
eply::strings. That way, I can be sure that "package:eply"
is not an import/dependency node for projects created with
early versions of drake.
@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Oct 1, 2017

Correction: I am using R.version$version.string, not R.version (and for all of the base packages, not just "base"). I do not want projects to rerun just because the machine changes. Even depending on the R version is a bit much.

Also, it may not be enough to remove the srcrefs. When I try installing the same package to two local libraries, I am getting different hashes (see the MWE below). For now, I have decided to deparse the functions instead, which is what drake does with the user's functions anyway. Unfortunately, deparsing is slower, so I would rather find some way to clean and serialize loaded code.

withr::with_dir(new = tempdir(), code = {

# Remove srcrefs from all the functions,
# then hash them all together.
hash_without_srcref <- function(pkg){
  digest::digest(
    eapply(
      asNamespace(pkg),
      function(x) {
        attr(x, "srcref") <- NULL
        x
      },
      all.names = TRUE
    )
  )
}

# Install a dummy package in a local library.
lib <- "local_lib"
dir.create(lib)
pkgenv <- new.env()
pkgenv$newfunction <- function(x){
  x + 1
}
withr::with_message_sink(
  new = tempfile(),
  code = {
    package.skeleton(name = "newpkg", environment = pkgenv)
  }
)
unlink(file.path("newpkg", "man"), recursive = TRUE)
install.packages("newpkg", type = "source", repos = NULL,
  lib = lib, quiet = TRUE)

# Install the same package in a different local library.
lib2 <- "local_lib2"
dir.create(lib2)
install.packages("newpkg", type = "source", repos = NULL,
  lib = lib2, quiet = TRUE)

# Compute the hash of newpkg in each local library.
withr::with_libpaths(
  new = c(lib, .libPaths()),
  code = {
    unloadNamespace("newpkg")
    hash1 <- hash_without_srcref("newpkg")
  }
)
withr::with_libpaths(
  new = c(lib2, .libPaths()),
  code = {
    unloadNamespace("newpkg")
    hash2 <- hash_without_srcref("newpkg")
  }
)

# I want them to be equal, but they are different.
testthat::expect_equal(hash1, hash2)
})

@wlandau-lilly
Copy link
Collaborator Author

Another concern about this whole issue: even if I deparse all the functions before hashing, some package hashes turn out different on different platforms. Example: knitr. On the issue6 branch of drake:

pkg_hash <- function(pkg){
  digest::digest(eapply(asNamespace(pkg),
    FUN = deparse, all.names = TRUE))
}

On Windows R-3.4.0:

packageVersion("knitr")
## [1] ‘1.17’
pkg_hash("knitr")
## [1] "70d9000607afbf3b1bf87651d73ce55d"

On Red Hat Linux R-3.4.0:

packageVersion("knitr")
## [1] ‘1.17’
pkg_hash("knitr")
## [1] "ae52c9275b69186123ef062f98d42b15"

In my opinion, these hashes overreact to differences in installed instances of packages. I think we will have to go with the most transparent thing: just depend on package versions. It's not ideal, but the users will know exactly what they are getting.

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Oct 1, 2017

The current solution to this issue is sufficiently implemented, tested, and documented for the master branch. For anyone just joining the thread, you might have a look at the new addendum to the reproducibility section of the README, or the analogous mentions in the drake.Rmd vignette or quickstart vignette.

There are obvious flaws in exclusively watching the version number of a package, but the other solutions are not much better. It is time to prioritize something straightforward. That way, users know exactly what they are getting and thus have complete control. The extra documentation and the additional functionality in plot_graph() should help. Now, when you hover over a package node, it shows the version number for a regular package and the R version string for a base package, which is exactly where drake watches for changes.

@AlexAxthelm
Copy link
Collaborator

Got bogged down with other work, but I think I might make working towards this a big part of my Hacktober projects. My current line of thought is to create a new package that would be able to detect changes in other packages, probably by doing most of the heavy lifting once when the package loads, and then looking for incremental changes after that.

@AlexAxthelm
Copy link
Collaborator

AlexAxthelm commented Oct 3, 2017

A nifty thing that I found re: load_all() and customized functions in the workspace though, is that direct comparison to the package function is faster than hashing. I need to check more the ensure that this holds for most functions, but with the ones that I've tried so far, it's been good.

> library(tidyverse)
> library(microbenchmark)
> library(digest)
> stored_hash <- digest(read_csv)
> read_csv <- function(...){cat(file);readr::read_csv(...)}
> times <- microbenchmark(
+ direct = identical(read_csv, readr::read_csv),
+ hashes = {new_hash <- digest(read_csv); new_hash == stored_hash}
+ )
> times %>% autoplot()
> times
Unit: microseconds
   expr    min      lq      mean median      uq       max neval
 direct  6.565  9.2310  11.24539 11.488 12.7190    19.282   100
 hashes 43.898 47.3845 310.99931 67.282 68.5135 24725.754   100

@wlandau-lilly
Copy link
Collaborator Author

Interesting. I wonder how long a function would have to be before hashing becomes the faster approach.

The complete solution using package versions is already implemented and documented in development drake, but I am willing to change direction if you can solve this problem. I could not myself, even though I tried to remove srcrefs. (In fact, the srcrefs were not even there to begin with.)

@wlandau-lilly
Copy link
Collaborator Author

Another note: I just tried utils::removeSource(), but with the same failure. See this SO post.

@wlandau-lilly
Copy link
Collaborator Author

See #103. After thinking it over, I believe tracking packages is outside the scope of drake, and I have already reverted my work on this. Packrat is a much better solution. It extends shelf lives and preserves projects. Drake, on the other hand, would shorten shelf lives and break projects if it tried to accomplish its own version of the same thing. Packrat is the ounce of prevention to drake's pound of cure. I explicitly recommend packrat and Docker in the README and vignettes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants