New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caching with dependencies on files #238

Closed
barryrowlingson opened this Issue May 16, 2012 · 19 comments

Comments

6 participants
@barryrowlingson

barryrowlingson commented May 16, 2012

If I have a chunk that reads a data file then modifying the data file doesn't invalidate the cached chunk. Obviously it would be hard for build_dep to spot these, but maybe they could be specified in the chunk header? Something like:

<<data, fdep="foo.csv">>
data = read.csv("foo.csv")
@ 

then this chunk would be fetched from the cache unless the file foo.csv has changed. Not sure how you'd specify multiple dependent data files on the chunk header...

Anyway, caching has already saved me enough time to submit this as a feature request :)

@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui May 17, 2012

Owner

see if the last section is clear enough to explain the cache.extra parameter in opts_chunk: http://yihui.name/knitr/demo/cache/

in your case, you can associate the cache with the MD5 hash of your file, e.g.

opts_chunk$set(cache.extra = tools::md5sum('foo.csv'))

then each time when foo.csv is modified, the md5sum will be different, and new cache will be built.

Owner

yihui commented May 17, 2012

see if the last section is clear enough to explain the cache.extra parameter in opts_chunk: http://yihui.name/knitr/demo/cache/

in your case, you can associate the cache with the MD5 hash of your file, e.g.

opts_chunk$set(cache.extra = tools::md5sum('foo.csv'))

then each time when foo.csv is modified, the md5sum will be different, and new cache will be built.

@spacedman

This comment has been minimized.

Show comment
Hide comment
@spacedman

spacedman May 17, 2012

Looks good. I was thinking about how to get the modification tme of the file instead - computing the md5 on a large data set might end up taking more time than the chunk execution. I wonder if the mtime from file.info will do...

spacedman commented May 17, 2012

Looks good. I was thinking about how to get the modification tme of the file instead - computing the md5 on a large data set might end up taking more time than the chunk execution. I wonder if the mtime from file.info will do...

@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui May 17, 2012

Owner

Yes, file.info()$mtime will be more light-weighted; just throw it into the cache.extra option.

Owner

yihui commented May 17, 2012

Yes, file.info()$mtime will be more light-weighted; just throw it into the cache.extra option.

@spacedman

This comment has been minimized.

Show comment
Hide comment
@spacedman

spacedman May 17, 2012

Works perfectly. Next idea is to use httr and HEAD to cache chunks dependent on a URL - so that chunks only run and then download the dataset if something has changed. Might be tricky though since what web servers send back in the HEAD might not be good enough.. Anyway, file dependencies working fine now! Brilliant!

spacedman commented May 17, 2012

Works perfectly. Next idea is to use httr and HEAD to cache chunks dependent on a URL - so that chunks only run and then download the dataset if something has changed. Might be tricky though since what web servers send back in the HEAD might not be good enough.. Anyway, file dependencies working fine now! Brilliant!

@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui May 18, 2012

Owner

You are welcome to experiment with the HEAD idea, and I will be happy to see examples :)

Owner

yihui commented May 18, 2012

You are welcome to experiment with the HEAD idea, and I will be happy to see examples :)

@yihui yihui closed this in c8e6b17 May 23, 2012

@lwaldron

This comment has been minimized.

Show comment
Hide comment
@lwaldron

lwaldron Aug 8, 2013

If I understand correctly,

 opts_chunk$set(cache.extra=tools::md5sum("foo.csv")) 

will cause the entire cache to be rebuilt if foo.csv changes. Is there a way to specify a cache.extra for one chunk only?

lwaldron commented Aug 8, 2013

If I understand correctly,

 opts_chunk$set(cache.extra=tools::md5sum("foo.csv")) 

will cause the entire cache to be rebuilt if foo.csv changes. Is there a way to specify a cache.extra for one chunk only?

@lwaldron

This comment has been minimized.

Show comment
Hide comment
@lwaldron

lwaldron Aug 8, 2013

I think I figured it out myself:

<<data, cache.extra=tools::md5sum("foo.csv")>>
data = read.csv("foo.csv")
@ 

lwaldron commented Aug 8, 2013

I think I figured it out myself:

<<data, cache.extra=tools::md5sum("foo.csv")>>
data = read.csv("foo.csv")
@ 
@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui Aug 8, 2013

Owner

@lwaldron You are absolutely correct.

Owner

yihui commented Aug 8, 2013

@lwaldron You are absolutely correct.

@lwaldron

This comment has been minimized.

Show comment
Hide comment
@lwaldron

lwaldron Aug 8, 2013

This is a really wonderful feature. In conjuction with dependson=, it is almost like having Make built right into R.

lwaldron commented Aug 8, 2013

This is a really wonderful feature. In conjuction with dependson=, it is almost like having Make built right into R.

@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui Aug 9, 2013

Owner

haha, to some extent, yes, it is Make in R 😄 and additional functions like dep_prev()/dep_auto() can make it better than Make because you may not even need to specify the dependencies, although they do not apply to your case here (file dependencies have to be manually specified).

Owner

yihui commented Aug 9, 2013

haha, to some extent, yes, it is Make in R 😄 and additional functions like dep_prev()/dep_auto() can make it better than Make because you may not even need to specify the dependencies, although they do not apply to your case here (file dependencies have to be manually specified).

@christophergandrud christophergandrud referenced this issue Aug 9, 2013

Closed

Updates for Second Edition #27

31 of 34 tasks complete
@jimhester

This comment has been minimized.

Show comment
Hide comment
@jimhester

jimhester Sep 12, 2013

Contributor

The only issue with using md5sums I can see is that they take a while to compute if the file is large. I use the file modification times to do file dependencies instead. That way if the file has been modified you can rerun the code. If you define the following function in your setup chunk

mtime <- function(files){
  lapply(Sys.glob(files),function(x) file.info(x)$mtime)
}

Then you can pass a list of filenames to cache extra in your regular chunks like so

```{r cache=TRUE, cache.extra=mtime('file1.txt', 'file2.txt')}
Contributor

jimhester commented Sep 12, 2013

The only issue with using md5sums I can see is that they take a while to compute if the file is large. I use the file modification times to do file dependencies instead. That way if the file has been modified you can rerun the code. If you define the following function in your setup chunk

mtime <- function(files){
  lapply(Sys.glob(files),function(x) file.info(x)$mtime)
}

Then you can pass a list of filenames to cache extra in your regular chunks like so

```{r cache=TRUE, cache.extra=mtime('file1.txt', 'file2.txt')}
@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui Sep 12, 2013

Owner

@jimhester thanks for the trick!

Owner

yihui commented Sep 12, 2013

@jimhester thanks for the trick!

@zachary-foster

This comment has been minimized.

Show comment
Hide comment
@zachary-foster

zachary-foster Apr 15, 2015

Contributor

This is great for checking for input file changes! However, sometimes I want to rerun a cached chunk when the output of that chunk has changed. For example, if an output file needed for following chunks is deleted, it would be nice if that chunk would detect the lack of the output file and recreate it next time the Rmd is built.

I might be missing somthing obvious, but I have tried for a while, and have not found a way to use cache.extra to accomplish this. Using cash.extra=mtime(output_files) does not work since the the chunk changes the modification times of the output files. Using cash.extra=file.exists(output_files) almost works, but it has to be run twice; The first run after a file is deleted cash.extra=FALSE and the chunk is run, creating output files, so next time cash.extra=TRUE and the chunk is run again. After that the chunk is not run anymore.

The best working solution I have so far is to delete the cache when output files are not found.

delete_chunk_cache <- function(chunk_name) {
  cache_path <- opts_current$get("cache.path")
  cached_files <- list.files(cache_path)
  search_pattern <- paste0("^", chunk_name, "_[a-z0-9]+\\.[a-z0-9]+$")
  chunk_cache_files <- cached_files[grepl(search_pattern, cached_files)]
  file.remove(file.path(cache_path, chunk_cache_files))
}

delete_next_chunk_cache <- function() {
  next_chunk <- all_labels()[which(all_labels() == opts_current$get("label")) + 1]
  delete_chunk_cache(next_chunk)
}

mtime <- function(files){
  lapply(Sys.glob(files),function(x) file.info(x)$mtime)
}
```{r chunk_1}
input_files = c("list", "of", "paths")
output_files = c("list", "of", "paths")
if (any(!file.exists(output_files))) delete_next_chunk_cache()
```{r chunk_2, cache=TRUE, cash.extra=mtime(input_files)}
system(some_command_that_makes_output_files)

Do you know of a better way to do this? Thanks!

Contributor

zachary-foster commented Apr 15, 2015

This is great for checking for input file changes! However, sometimes I want to rerun a cached chunk when the output of that chunk has changed. For example, if an output file needed for following chunks is deleted, it would be nice if that chunk would detect the lack of the output file and recreate it next time the Rmd is built.

I might be missing somthing obvious, but I have tried for a while, and have not found a way to use cache.extra to accomplish this. Using cash.extra=mtime(output_files) does not work since the the chunk changes the modification times of the output files. Using cash.extra=file.exists(output_files) almost works, but it has to be run twice; The first run after a file is deleted cash.extra=FALSE and the chunk is run, creating output files, so next time cash.extra=TRUE and the chunk is run again. After that the chunk is not run anymore.

The best working solution I have so far is to delete the cache when output files are not found.

delete_chunk_cache <- function(chunk_name) {
  cache_path <- opts_current$get("cache.path")
  cached_files <- list.files(cache_path)
  search_pattern <- paste0("^", chunk_name, "_[a-z0-9]+\\.[a-z0-9]+$")
  chunk_cache_files <- cached_files[grepl(search_pattern, cached_files)]
  file.remove(file.path(cache_path, chunk_cache_files))
}

delete_next_chunk_cache <- function() {
  next_chunk <- all_labels()[which(all_labels() == opts_current$get("label")) + 1]
  delete_chunk_cache(next_chunk)
}

mtime <- function(files){
  lapply(Sys.glob(files),function(x) file.info(x)$mtime)
}
```{r chunk_1}
input_files = c("list", "of", "paths")
output_files = c("list", "of", "paths")
if (any(!file.exists(output_files))) delete_next_chunk_cache()
```{r chunk_2, cache=TRUE, cash.extra=mtime(input_files)}
system(some_command_that_makes_output_files)

Do you know of a better way to do this? Thanks!

@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui Apr 16, 2015

Owner

@zachary-foster You can probably disable cache if the file does not exist, e.g. cache = file.exists('some_file').

Owner

yihui commented Apr 16, 2015

@zachary-foster You can probably disable cache if the file does not exist, e.g. cache = file.exists('some_file').

@zachary-foster

This comment has been minimized.

Show comment
Hide comment
@zachary-foster

zachary-foster Apr 16, 2015

Contributor

Thanks for the quick reply! I tried that and it works ok, but it has one small issue. If the output file is deleted and a change is made to the code, it takes two runs of the chunk to cache correctly. On the first build after the deletion of the output file and a change in code, cache=FALSE so it runs (like it should), but it does not record the change in code, so when it is built a second time and cache=TRUE (since output files were created on the first run) it notices the change in code and runs again.

However, a simultaneous change in output files and code is probably rare enough that this solution is sufficient.

Would a chunk option like rebuild.cache = !file.exists('some_file') be useful? rebuild.cache would not disable caching, but when TRUE, it forces the chunk to be evaluated regardless of code changes. I think that would work right if the code and output are changed at the same time. I do a bit of R package development. I can try to add the rebuild.cache chunk option if you think it would be useful.

Thank you!

Contributor

zachary-foster commented Apr 16, 2015

Thanks for the quick reply! I tried that and it works ok, but it has one small issue. If the output file is deleted and a change is made to the code, it takes two runs of the chunk to cache correctly. On the first build after the deletion of the output file and a change in code, cache=FALSE so it runs (like it should), but it does not record the change in code, so when it is built a second time and cache=TRUE (since output files were created on the first run) it notices the change in code and runs again.

However, a simultaneous change in output files and code is probably rare enough that this solution is sufficient.

Would a chunk option like rebuild.cache = !file.exists('some_file') be useful? rebuild.cache would not disable caching, but when TRUE, it forces the chunk to be evaluated regardless of code changes. I think that would work right if the code and output are changed at the same time. I do a bit of R package development. I can try to add the rebuild.cache chunk option if you think it would be useful.

Thank you!

@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui Apr 16, 2015

Owner

@zachary-foster Normally caching invalidation is only about the input: if the input changes, we invalidate the cache. In your case, it sounds there is a circular dependency -- the code chunk depends on its output. I'm not sure why that has to be the case, i.e. why this code chunk needs its own output file to run.

Owner

yihui commented Apr 16, 2015

@zachary-foster Normally caching invalidation is only about the input: if the input changes, we invalidate the cache. In your case, it sounds there is a circular dependency -- the code chunk depends on its output. I'm not sure why that has to be the case, i.e. why this code chunk needs its own output file to run.

@zachary-foster

This comment has been minimized.

Show comment
Hide comment
@zachary-foster

zachary-foster Apr 16, 2015

Contributor

It does not need its own output file to run, but it is part of a "pipeline" that calls many external commands. During a given build, it is checking for the ouput of the previous build. If the chunk's output file is deleted, but the code is not changed, the chunk is not run and then the following chunks break because the commands they call need the first chunks output.

Contributor

zachary-foster commented Apr 16, 2015

It does not need its own output file to run, but it is part of a "pipeline" that calls many external commands. During a given build, it is checking for the ouput of the previous build. If the chunk's output file is deleted, but the code is not changed, the chunk is not run and then the following chunks break because the commands they call need the first chunks output.

@yihui

This comment has been minimized.

Show comment
Hide comment
@yihui

yihui Apr 16, 2015

Owner

@zachary-foster I see. That makes sense now. A pull request is welcome :) I prefer the option name to be cache.rebuild instead of rebuild.cache. When you calculate the MD5 digest, remember to exclude this option: https://github.com/yihui/knitr/blob/master/R/block.R#L81

Owner

yihui commented Apr 16, 2015

@zachary-foster I see. That makes sense now. A pull request is welcome :) I prefer the option name to be cache.rebuild instead of rebuild.cache. When you calculate the MD5 digest, remember to exclude this option: https://github.com/yihui/knitr/blob/master/R/block.R#L81

@zachary-foster

This comment has been minimized.

Show comment
Hide comment
@zachary-foster

zachary-foster Apr 17, 2015

Contributor

Thanks for the tip! That probably saved me some debugging time. Pull request sent. Thanks for considering it!

Contributor

zachary-foster commented Apr 17, 2015

Thanks for the tip! That probably saved me some debugging time. Pull request sent. Thanks for considering it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment