Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted storr when killing making process #126

Closed
violetcereza opened this issue Nov 1, 2017 · 17 comments
Closed

Corrupted storr when killing making process #126

violetcereza opened this issue Nov 1, 2017 · 17 comments

Comments

@violetcereza
Copy link
Contributor

I'm aware that this may be beyond the scope of drake, but occasionally I get this weird non-reproduce-able bug. In the process of working with my data, sometimes I mistakenly start making a target that I know will fail or be incorrect. This freezes up my R session while I wait for it to resolve. Instead of waiting, I kill the RStudio process, or the Makefile process from htop.

This usually works fine -- in fact, using Makefile parallelism is great because you can see which targets are associated with which processes directly from htop and when you kill a process you don't have to relaunch your RStudio session and lose everything in your global environment.

However, once in a blue moon, killing a process seems to have consequences. You get these targets that seem to be stuck in a half-saved state, and that cause drake functions to fail with the error:

Error in readRDS(self$name_hash(hash)) : error reading from connection

There seems to be no way to recover from this, other than deleting the whole .drake directory and starting from scratch. Even if you know which target was corrupted, clean(target) fails with the same error. I'm having some success with get_cache()$del(target), which makes some things work again but still seems to leave the cache in a semi-corrupted state.

I guess I'm just posting to see if anyone else has had similar experiences / how they fixed them without clearing the whole cache.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 1, 2017

@dapperjapper I am really glad you brought this up. I think the top priority is to improve storr, maybe adding something like this:

library(storr)
x <- storr_rds("my_cache")
x$set(key = "a", value = "b")
# break 'a'
x$repair() # remove dangling files for half-saved targets.

Would you also post this on the storr issue tracker? The solution will ultimately come from there. Still, I will keep this drake thread open for a while.

After a failed clean(target), do you get a traceback() that dives into the call stack of storr functions?

I am also wondering how to improve job monitoring. For example, what would it take to label and monitor jobs by target name for all kinds of parallelism, not just Makefile parallelism? For "future_lapply" parallelism, there is probably some way that involves the configuration *tmpl file, but I would also like something similar for other forms of parallelism. Drake has in_progress() and progress(), but that does not provide a way to kill jobs. processx has easier job monitoring, but the scope looks limited to processes launched with processx itself.

@violetcereza
Copy link
Contributor Author

I will post on storr issue tracker! Thanks for quick response!

I don't know a lot about processes/*nix systems in general, so I don't think I can be terribly helpful there. One thing I have been wondering about is make()ing in the background without blocking the R session. It seems like knitting Rmds is able to run without blocking, and I'm wondering if drake could ever get a spot like that. Does the knitting process have it's own separate R session that is maintained by RStudio? Can drake get one?

Here is a traceback:

> clean(premiers_data_raw_sept)

Error in readRDS(self$name_hash(hash)) : error reading from connection

> traceback()

13: readRDS(self$name_hash(hash))
12: self$driver$get_object(hash)
11: self$get_value(self$get_hash(key, namespace), use_cache)
10: cache$get(target)
9: (function (target, cache) 
   {
       if (!(target %in% cache$list())) 
           return(FALSE)
       cache$get(target)$imported
   })(target = dots[[1L]][[1L]], cache = <environment>)
8: mapply(FUN = function (target, cache) 
   {
       if (!(target %in% cache$list())) 
           return(FALSE)
       cache$get(target)$imported
   }, target = "premiers_data_raw_sept", MoreArgs = list(cache = <environment>), 
       SIMPLIFY = TRUE, USE.NAMES = TRUE)
7: do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]), 
       SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
6: is_imported(target = target, cache = cache)
5: (function (target, cache) 
   {
       if (is.null(cache)) {
           return()
       }
       if (is_file(target) & !is_imported(target = target, cache = cache)) {
           unquote(target) %>% unlink(recursive = TRUE, force = TRUE)
       }
       default <- cache$default_namespace
       for (space in c(default, "depends", "filemtime", "functions")) if (target %in% 
           cache$list(namespace = space)) {
           cache$del(target, namespace = space)
       }
       invisible()
   })(target = dots[[1L]][[1L]], cache = <environment>)
4: mapply(FUN = function (target, cache) 
   {
       if (is.null(cache)) {
           return()
       }
       if (is_file(target) & !is_imported(target = target, cache = cache)) {
           unquote(target) %>% unlink(recursive = TRUE, force = TRUE)
       }
       default <- cache$default_namespace
       for (space in c(default, "depends", "filemtime", "functions")) if (target %in% 
           cache$list(namespace = space)) {
           cache$del(target, namespace = space)
       }
       invisible()
   }, target = "premiers_data_raw_sept", MoreArgs = list(cache = <environment>), 
       SIMPLIFY = TRUE, USE.NAMES = TRUE)
3: do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]), 
       SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
2: uncache(targets, cache = cache)
1: clean(premiers_data_raw_sept)

@violetcereza
Copy link
Contributor Author

Semi-related Q: why are you storing object metadata like $type and $imported inside the objects namespace alongside the object itself, instead of peeling them off to a separate storr namespace?

@wlandau-lilly
Copy link
Collaborator

For $type and $imported, that was a poor design choice. Basically, $type and $imported detect imported functions. If a function is imported from the user's environment, I want it to depend on any nested imports in the body. For the basic example, storr_rds(".drake")$get("reg1") also has a $depends element that depends on nested objects. Essentially, I put in the default namespace everything that I want to be reproducibly tracked. In hindsight, $type and $imported should not be there. If the value doesn't change, it should not matter if you change a target into an import in your workflow plan.

Fixing that will be ambitious, especially because of back compatibility concerns. A separate preprocessing step will be needed in order to migrate data to the appropriate storr namespaces before a make(). But I think it can be done.

Here, I see that future can casually run stuff in the background. I will probably try this on my own projects, and maybe include it in the documentation, but probably not work it into drake's code base. It certainly is not compatible with drake's future_lapply() backend. processx and svSocket may be worth considering too.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 1, 2017

If we are going to refactor the most sensitive internals, we are most likely not going to have back compatibility. So I want to think long and hard about it, plan and discuss with you guys, and get your support. I probably will not work on it until well after the next CRAN release at least. When I do work on it, I want to share formal design specs and talk about them with the community.

I learned a TON over the course of drake's development, and there is a lot I would do differently, especially more elegantly, if I went back and did it again from scratch. But for sensitive changes to hashing and caching, we need patience and careful planning.

@wlandau-lilly
Copy link
Collaborator

To be clear, I have not made the decision to refactor anything that deep. I think there needs to be a stronger reason than just the elegance of the code. Back compatibility is more important than the slight oversensitivity due to tracking $type and $imported.

@violetcereza
Copy link
Contributor Author

Definitely -- I agree that it is a big decision. The reason I ask is that the traceback seems to include is_imported(), which seems to be loading the corrupted file before clean()ing it. It seems like overkill to have to load the whole built object just to determine if it is imported or not, and additionally it breaks clean() if the given object is corrupted.

Is this also the reason why plot_graph() is rather slow for me, when working on a project with large targets? Is it loading every target into memory one-by-one to check the metadata?

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 1, 2017

We could put some exception handling around is_imported() in clean(). Better yet, for speed, we could even add a redundant imported namespace and check that instead. To avoid the problem of half-saved objects, we should cache to that namespace as early as possible in a make().

In plot_graph(), there are two bottlenecks:

  1. Building the igraph object that shows dependency relationships. Skip this with a pre-built config.
  2. Importing everything to check which targets are outdated. Skip this with from_scratch = TRUE in development drake.

@wlandau-lilly
Copy link
Collaborator

But anyway, the immediate way forward on this is definitely a separate imported namespace. I think it is enough to

  1. Cache TRUE or FALSE to the imported namespace just above this line in build_in_hook()
  2. Move that call to store_target() to the very end of build_in_hook(), just before returning the value. This ensures all the small bits of the cache are saved before any really big values.

@wlandau-lilly
Copy link
Collaborator

@dapperjapper I think I fixed it in 661e59f. Would you try again?

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 2, 2017

By the way, I think clean() might run faster now.

@violetcereza
Copy link
Contributor Author

violetcereza commented Nov 2, 2017 via email

@wlandau-lilly
Copy link
Collaborator

Actually, building the igraph does not touch the cache at all. See calls to lightly_parallelize() for my attempt to make it faster, along with the thread for #121, particularly here and the follow-up.

@wlandau-lilly
Copy link
Collaborator

Hmmm. I actually think I won't truly fix this one until I solve #129 or richfitz/storr#55 is fixed I thought I had fixed clean() but, but the trouble is that old projects do not have an imported namespace. We will have to have to migrate the cache of an old project to a new format before doing anything else. However, in new projects, clean() should work just fine.

FYI: 661e59f created a back-compatibility bug in clean() that I fixed with 99b7db2.

@kendonB
Copy link
Contributor

kendonB commented Nov 3, 2017

I also had the OP's error and it magically went away - likely after I inadvertently upgraded for a different new feature.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 4, 2017

Very helpful to know. I think I will implement something like @fmichonneau's suggestion for storr over all the namespaces and then close.

wlandau-lilly added a commit that referenced this issue Nov 4, 2017
@wlandau-lilly
Copy link
Collaborator

Fixed along with the solution to #129. See drake::rescue_cache().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants