Corrupted storr when killing making process #126

violetcereza · 2017-11-01T18:21:27Z

I'm aware that this may be beyond the scope of drake, but occasionally I get this weird non-reproduce-able bug. In the process of working with my data, sometimes I mistakenly start making a target that I know will fail or be incorrect. This freezes up my R session while I wait for it to resolve. Instead of waiting, I kill the RStudio process, or the Makefile process from htop.

This usually works fine -- in fact, using Makefile parallelism is great because you can see which targets are associated with which processes directly from htop and when you kill a process you don't have to relaunch your RStudio session and lose everything in your global environment.

However, once in a blue moon, killing a process seems to have consequences. You get these targets that seem to be stuck in a half-saved state, and that cause drake functions to fail with the error:

Error in readRDS(self$name_hash(hash)) : error reading from connection

There seems to be no way to recover from this, other than deleting the whole .drake directory and starting from scratch. Even if you know which target was corrupted, clean(target) fails with the same error. I'm having some success with get_cache()$del(target), which makes some things work again but still seems to leave the cache in a semi-corrupted state.

I guess I'm just posting to see if anyone else has had similar experiences / how they fixed them without clearing the whole cache.

The text was updated successfully, but these errors were encountered:

wlandau-lilly · 2017-11-01T19:59:08Z

@dapperjapper I am really glad you brought this up. I think the top priority is to improve storr, maybe adding something like this:

library(storr)
x <- storr_rds("my_cache")
x$set(key = "a", value = "b")
# break 'a'
x$repair() # remove dangling files for half-saved targets.

Would you also post this on the storr issue tracker? The solution will ultimately come from there. Still, I will keep this drake thread open for a while.

After a failed clean(target), do you get a traceback() that dives into the call stack of storr functions?

I am also wondering how to improve job monitoring. For example, what would it take to label and monitor jobs by target name for all kinds of parallelism, not just Makefile parallelism? For "future_lapply" parallelism, there is probably some way that involves the configuration *tmpl file, but I would also like something similar for other forms of parallelism. Drake has in_progress() and progress(), but that does not provide a way to kill jobs. processx has easier job monitoring, but the scope looks limited to processes launched with processx itself.

violetcereza · 2017-11-01T20:51:51Z

I will post on storr issue tracker! Thanks for quick response!

I don't know a lot about processes/*nix systems in general, so I don't think I can be terribly helpful there. One thing I have been wondering about is make()ing in the background without blocking the R session. It seems like knitting Rmds is able to run without blocking, and I'm wondering if drake could ever get a spot like that. Does the knitting process have it's own separate R session that is maintained by RStudio? Can drake get one?

Here is a traceback:

> clean(premiers_data_raw_sept)

Error in readRDS(self$name_hash(hash)) : error reading from connection

> traceback()

13: readRDS(self$name_hash(hash))
12: self$driver$get_object(hash)
11: self$get_value(self$get_hash(key, namespace), use_cache)
10: cache$get(target)
9: (function (target, cache) 
   {
       if (!(target %in% cache$list())) 
           return(FALSE)
       cache$get(target)$imported
   })(target = dots[[1L]][[1L]], cache = <environment>)
8: mapply(FUN = function (target, cache) 
   {
       if (!(target %in% cache$list())) 
           return(FALSE)
       cache$get(target)$imported
   }, target = "premiers_data_raw_sept", MoreArgs = list(cache = <environment>), 
       SIMPLIFY = TRUE, USE.NAMES = TRUE)
7: do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]), 
       SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
6: is_imported(target = target, cache = cache)
5: (function (target, cache) 
   {
       if (is.null(cache)) {
           return()
       }
       if (is_file(target) & !is_imported(target = target, cache = cache)) {
           unquote(target) %>% unlink(recursive = TRUE, force = TRUE)
       }
       default <- cache$default_namespace
       for (space in c(default, "depends", "filemtime", "functions")) if (target %in% 
           cache$list(namespace = space)) {
           cache$del(target, namespace = space)
       }
       invisible()
   })(target = dots[[1L]][[1L]], cache = <environment>)
4: mapply(FUN = function (target, cache) 
   {
       if (is.null(cache)) {
           return()
       }
       if (is_file(target) & !is_imported(target = target, cache = cache)) {
           unquote(target) %>% unlink(recursive = TRUE, force = TRUE)
       }
       default <- cache$default_namespace
       for (space in c(default, "depends", "filemtime", "functions")) if (target %in% 
           cache$list(namespace = space)) {
           cache$del(target, namespace = space)
       }
       invisible()
   }, target = "premiers_data_raw_sept", MoreArgs = list(cache = <environment>), 
       SIMPLIFY = TRUE, USE.NAMES = TRUE)
3: do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]), 
       SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
2: uncache(targets, cache = cache)
1: clean(premiers_data_raw_sept)

violetcereza · 2017-11-01T20:56:10Z

Semi-related Q: why are you storing object metadata like $type and $imported inside the objects namespace alongside the object itself, instead of peeling them off to a separate storr namespace?

wlandau-lilly · 2017-11-01T21:29:52Z

For $type and $imported, that was a poor design choice. Basically, $type and $imported detect imported functions. If a function is imported from the user's environment, I want it to depend on any nested imports in the body. For the basic example, storr_rds(".drake")$get("reg1") also has a $depends element that depends on nested objects. Essentially, I put in the default namespace everything that I want to be reproducibly tracked. In hindsight, $type and $imported should not be there. If the value doesn't change, it should not matter if you change a target into an import in your workflow plan.

Fixing that will be ambitious, especially because of back compatibility concerns. A separate preprocessing step will be needed in order to migrate data to the appropriate storr namespaces before a make(). But I think it can be done.

Here, I see that future can casually run stuff in the background. I will probably try this on my own projects, and maybe include it in the documentation, but probably not work it into drake's code base. It certainly is not compatible with drake's future_lapply() backend. processx and svSocket may be worth considering too.

wlandau-lilly · 2017-11-01T21:55:09Z

If we are going to refactor the most sensitive internals, we are most likely not going to have back compatibility. So I want to think long and hard about it, plan and discuss with you guys, and get your support. I probably will not work on it until well after the next CRAN release at least. When I do work on it, I want to share formal design specs and talk about them with the community.

I learned a TON over the course of drake's development, and there is a lot I would do differently, especially more elegantly, if I went back and did it again from scratch. But for sensitive changes to hashing and caching, we need patience and careful planning.

wlandau-lilly · 2017-11-01T22:02:39Z

To be clear, I have not made the decision to refactor anything that deep. I think there needs to be a stronger reason than just the elegance of the code. Back compatibility is more important than the slight oversensitivity due to tracking $type and $imported.

violetcereza · 2017-11-01T22:11:51Z

Definitely -- I agree that it is a big decision. The reason I ask is that the traceback seems to include is_imported(), which seems to be loading the corrupted file before clean()ing it. It seems like overkill to have to load the whole built object just to determine if it is imported or not, and additionally it breaks clean() if the given object is corrupted.

Is this also the reason why plot_graph() is rather slow for me, when working on a project with large targets? Is it loading every target into memory one-by-one to check the metadata?

wlandau-lilly · 2017-11-01T22:18:26Z

We could put some exception handling around is_imported() in clean(). Better yet, for speed, we could even add a redundant imported namespace and check that instead. To avoid the problem of half-saved objects, we should cache to that namespace as early as possible in a make().

In plot_graph(), there are two bottlenecks:

Building the igraph object that shows dependency relationships. Skip this with a pre-built config.
Importing everything to check which targets are outdated. Skip this with from_scratch = TRUE in development drake.

wlandau-lilly · 2017-11-01T23:46:24Z

But anyway, the immediate way forward on this is definitely a separate imported namespace. I think it is enough to

Cache TRUE or FALSE to the imported namespace just above this line in build_in_hook()
Move that call to store_target() to the very end of build_in_hook(), just before returning the value. This ensures all the small bits of the cache are saved before any really big values.

wlandau-lilly · 2017-11-02T03:13:15Z

@dapperjapper I think I fixed it in 661e59f. Would you try again?

wlandau-lilly · 2017-11-02T03:13:40Z

By the way, I think clean() might run faster now.

violetcereza · 2017-11-02T03:15:58Z

Is it possible to cache a redundant version of $depends in a separate namespace, so that assembling the igraph works faster? I can also look at this tomorrow..

…

On Wed, Nov 1, 2017 at 11:13 PM Will Landau ***@***.***> wrote: By the way, I think clean() runs much faster now. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#126 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPCB18FRv67a-iGs5xCVyTG0CEvWF1Rks5syTNkgaJpZM4QOnHU> .

wlandau-lilly · 2017-11-02T03:20:18Z

Actually, building the igraph does not touch the cache at all. See calls to lightly_parallelize() for my attempt to make it faster, along with the thread for #121, particularly here and the follow-up.

wlandau-lilly · 2017-11-02T12:34:46Z

Hmmm. I actually think I won't truly fix this one until I solve #129 or richfitz/storr#55 is fixed I thought I had fixed clean() but, but the trouble is that old projects do not have an imported namespace. We will have to have to migrate the cache of an old project to a new format before doing anything else. However, in new projects, clean() should work just fine.

FYI: 661e59f created a back-compatibility bug in clean() that I fixed with 99b7db2.

kendonB · 2017-11-03T21:38:00Z

I also had the OP's error and it magically went away - likely after I inadvertently upgraded for a different new feature.

wlandau-lilly · 2017-11-04T00:07:52Z

Very helpful to know. I think I will implement something like @fmichonneau's suggestion for storr over all the namespaces and then close.

wlandau-lilly · 2017-11-05T05:59:31Z

Fixed along with the solution to #129. See drake::rescue_cache().

violetcereza mentioned this issue Nov 1, 2017

Corrupted storr when killed while writing data richfitz/storr#55

Closed

This was referenced Nov 1, 2017

Optionally run make() in background #128

Closed

Spread out cache values into different storr namespaces #129

Closed

wlandau-lilly mentioned this issue Nov 2, 2017

Last call for suggestions before CRAN release 4.4.0 #132

Closed

wlandau-lilly added SENSITIVE CHANGES difficulty: advanced and removed SENSITIVE CHANGES difficulty: advanced labels Nov 2, 2017

wlandau-lilly mentioned this issue Nov 3, 2017

Back compatibility and #129 #134

Closed

wlandau-lilly added a commit that referenced this issue Nov 4, 2017

Solve #126 with rescue_cache()

6f5c06c

wlandau-lilly closed this as completed Nov 5, 2017

violetcereza mentioned this issue Nov 4, 2018

Remove all non-clustermq parallel backends? #561

Closed

wlandau mentioned this issue Feb 27, 2019

Cache in bad state after clustermq job with worker caching #754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted storr when killing making process #126

Corrupted storr when killing making process #126

violetcereza commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017 •

edited

violetcereza commented Nov 1, 2017

violetcereza commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017 •

edited

wlandau-lilly commented Nov 1, 2017

violetcereza commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017 •

edited

wlandau-lilly commented Nov 1, 2017

wlandau-lilly commented Nov 2, 2017

wlandau-lilly commented Nov 2, 2017 •

edited

violetcereza commented Nov 2, 2017 via email

wlandau-lilly commented Nov 2, 2017

wlandau-lilly commented Nov 2, 2017

kendonB commented Nov 3, 2017

wlandau-lilly commented Nov 4, 2017 •

edited

wlandau-lilly commented Nov 5, 2017

Corrupted storr when killing making process #126

Corrupted storr when killing making process #126

Comments

violetcereza commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017 • edited

violetcereza commented Nov 1, 2017

violetcereza commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017 • edited

wlandau-lilly commented Nov 1, 2017

violetcereza commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017 • edited

wlandau-lilly commented Nov 1, 2017

wlandau-lilly commented Nov 2, 2017

wlandau-lilly commented Nov 2, 2017 • edited

violetcereza commented Nov 2, 2017 via email

wlandau-lilly commented Nov 2, 2017

wlandau-lilly commented Nov 2, 2017

kendonB commented Nov 3, 2017

wlandau-lilly commented Nov 4, 2017 • edited

wlandau-lilly commented Nov 5, 2017

wlandau-lilly commented Nov 1, 2017 •

edited

wlandau-lilly commented Nov 1, 2017 •

edited

wlandau-lilly commented Nov 1, 2017 •

edited

wlandau-lilly commented Nov 2, 2017 •

edited

wlandau-lilly commented Nov 4, 2017 •

edited