Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spread out cache values into different storr namespaces #129

Closed
10 tasks done
wlandau-lilly opened this issue Nov 1, 2017 · 3 comments
Closed
10 tasks done

Spread out cache values into different storr namespaces #129

wlandau-lilly opened this issue Nov 1, 2017 · 3 comments

Comments

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 1, 2017

See #126 from here. Work will be kept on the issue129 branch branch until enough progress has been made.

  • Implement the important namespaces in a new issue129 branch.
  • Implement a new user-side dependency_profile() function to help both users and developers figure out why a given target is listed as up to date or outdated.
  • Document the new dependency_profile() function (quickstart and caution vignettes, plus the list of useful functions).
  • Stop make() and explain if the cache has the old format. Help the user migrate to the new cache format.
  • Add a force argument to make() to force a make() even if it is not recommended, as with a cache in the wrong file format. An un-migrated project should build fine, but it will build from scratch.
  • Implement migrate() to transfer a project from the old cache format to the new one.
  • Implement back compatibility unit tests (i.e. tests of migrate()).
    • Test a fully up-to-date cache from the basic example with drake 4.1.0 and maintained in the repo as a zip file.
    • Test a half-outdated cache. (Maybe load_basic_example() and change reg2())
    • Be sure to test get_cache(), make(), outdated(), and clean() on each.
@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Nov 2, 2017

Current namespaces:

namespace description
build_times build times of each target and import. Includes user, elapsed, and system times. Each entry here is a data frame.
config each entry here is an element of the master internal configuration produced by config
depends the dependency hash: the hash of the collective storr hashes of all the dependencies of each target/import.
filemtime the system modification times of files at the time they are built or imported. Useful for figuring out whether it is even worth the time to rehash a file (see #4).
functions For imported functions, the actual raw value of the function read by readd(). In the objects namespace, the stored object for functions is the un-vectorized, deparsed function body as text. This makes sure the correct function is reproducibly tracked, and changes to whitespace and comments are ignored.
imported a new namespace in 4.4.0 indicating whether each object was imported or built as a target. Will migrate to this instead of the $imported flag in the objects namespace. It is progress on #126.
objects default namespace, contains the list object that is reproducibly tracked for each target. Contains the actual value of the built target/import, plus metadata like object type and an "imported" flag. For functions, a collective hash of the dependencies is also stored so that imported objects nested within functions are reproducibly tracked.
progress stores the build progress of each target: "finished", "in progress", "failed". Unlisted targets were not attempted (yet).
session contains the sessionInfo() of the last call to make().
target_attempts names of the targets marked to be built in the current make(). Used as a parallelism-agnostic mechanism for telling whether a target is up to date.

Planned namespaces:

namespace description
build_times same as before
config same as before
commands the workflow plan data frame command that built the target, if applicable. Storing this separately from depends will give us part of #131.
depends the long hash of the output from lightly_parallelize(X = the_appropriate_dependencies, FUN = cache$get_hash) %>% unlist %>% unname. Major difference from before: the workflow plan command is not factored into the computation.
depends_debug named vector of hashes from get_cache()$get_hash(..., namespace = 'reproducibly_tracked'). Stripping the names away and hashing should evaluate to the hash in depends. A new debug argument to make() will trigger the use of the depends_debug namespace. It is expensive in time and storage, but essential for debugging how things are reproducibly tracked.
file_modification_times migrate from filemtime
imported same as above
progress same as before
readd The value of that should be read from the cache on a call to readd(target). This namespace will generally share values with reproducibly_track via richfitz/storr#56 to avoid duplication of data in the cache. But for imported functions, the value stored will be the de-vectorized/deparsed/tidied function body text and the dependency hash.
reproducibly_tracked object to be reproducibly tracked. Changes to the data here should trigger downstream (re)builds.
session same as before
target_attempts same as before
type From the $type field of the old objects namespace: indicator of whether the target/import is a function, file, or generic object.

Migration of namespaces via migrate():

to from how
commands config copy over the stored workflow plan command for each target
depends objects This part is the trickiest and most sensitive. 1. Use the dependency_hash() function and supporting functions of an earlier drake (4.3.0) to figure out which targets are outdated. Be sure to quarantine the computation in a fresh environment. 2. Walk through the workflow graph and compute brand new depends hashes for everything. 3. For all originally outdated targets, mangle the new depends hashes.
file_modification_times filemtime simple copies
imported objects copy over the $imported list element.
readd functions simple copies
readd objects For non-functions only, copy over the $value list element
reproducibly_tracked objects Copy over the value list element. For functions, this is a simple copy. For non-functions, only the hash should be transferred via richfitz/storr#56 to avoid duplication of data.
type objects copy over the $type list element

@wlandau-lilly wlandau-lilly changed the title Spread out cache values into different storr namespaces Spread out cache values into different storr namespaces Nov 3, 2017
@wlandau-lilly
Copy link
Collaborator Author

migrate():

  • Use old code to identify the outdated targets.
  • Lightly parallelize: for each target/import:
    • readd() it to get the actual value.
    • Call store_target() to automatically put it in the right place in the cache. Use a dummy hash_list for this stage.
  • Get the hash_list() of everything and store the depends hashes that mark everything as up to date.
  • For the otudated targets, mangle the depends hashes.

This seems straightforward and testable.

@wlandau-lilly
Copy link
Collaborator Author

I decided to fix and merge #129. It is the right decision in the long term. Plus, migrate() seems to be working well. I unit-tested it and tried it out on a couple of large real projects.

The namespaces are a bit different than in earlier comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant