potential to add check that data files are committed #242

stephens999 · 2021-03-07T15:13:22Z

How feasible would it be to add a check that any files being sourced or loaded are included in the repo?
eg. it seems to be an easy mistake to make to have a line like:

dat = readRDS('data/my_dat.rds')

in my Rmd file but forget to publish data/my_dat.rds
The ability to check for this kind of thing could be helpful, although maybe not so straightforward?

The text was updated successfully, but these errors were encountered:

jdblischak · 2021-03-08T12:06:58Z

How feasible would it be to add a check that any files being sourced or loaded are included in the repo?

I think this is a great feature, but to be robust, it would need a more substantial solution like the ideas discussed in #9. Without the user specifying which files are input (and should be committed), then it's very hard for workflowr to guess what should be done.

Without this extra infrastructure, we could do something like the following:

When a file is built, do a regex search for file paths starting with data/.
Check if these files have been committed
If they haven't been committed, try to warn the user in some way. Maybe a warning message after the code chunk saying that the previous file wasn't committed? I don't want to add a failing check to the workflowr report, since a user may not want to commit the file (maybe it's a huge file, and it's available to download on a public archive)

A big caveat of the above is that this check only works for files in data/, which is only a suggested directory name.

Thoughts?

pcarbo · 2021-03-08T12:43:09Z

I agree it would be a nice feature to have despite inevitable caveats.

stephens999 · 2021-03-10T03:06:29Z

i think it is useful. What if a Rmd file saves a file to data/ rather than loads it? do we want to also emit a warning if the file is not tracked? (possibly yes?)

…

On Mon, Mar 8, 2021 at 6:43 AM Peter Carbonetto ***@***.***> wrote: I agree it would be a nice feature to have despite inevitable caveats. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#242 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANXRRMSXTQ7X5XMZSEQSELTCTA6ZANCNFSM4YX6K2HQ> .

jdblischak · 2021-03-12T14:19:17Z

What if a Rmd file saves a file to data/ rather than loads it?

The more I think about the caveats, the more I don't like this feature. It's a good point that users can write to data/ as well. In that case, it wouldn't make sense to throw a warning that the file hasn't been committed yet. In fact, it might not even exist at the time the workflowr code is running to check its Git status.

The problem is that we don't have any clever way to really know what the code is doing as far as input/output. Some examples:

# Is the code reading or writing?
x <- customPkg::customFunc("data/file.txt")

# Regexes require the file path to be a contiguous string
x <- read.table(file.path("data", "file.txt"))

# If the user sets knit_root_dir to analysis/ and uses the here package to resolve file paths.
# Workflowr would look for the file in analysis/data/file.txt b/c it's not executing the code
x <- read.table(here::here("data/file.txt"))

This problem reminds me of how the drake package handles dependencies in Rmd files. You have to use its custom function readd() to import the data file. This is the only way that drake can know which files are input files that need to be checked to see if they have been modified.

jdblischak mentioned this issue Mar 19, 2021

Add functionality to record processing time of the whole script or individual parts of the script #244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential to add check that data files are committed #242

potential to add check that data files are committed #242

stephens999 commented Mar 7, 2021

jdblischak commented Mar 8, 2021

pcarbo commented Mar 8, 2021

stephens999 commented Mar 10, 2021 via email

jdblischak commented Mar 12, 2021

potential to add check that data files are committed #242

potential to add check that data files are committed #242

Comments

stephens999 commented Mar 7, 2021

jdblischak commented Mar 8, 2021

pcarbo commented Mar 8, 2021

stephens999 commented Mar 10, 2021 via email

jdblischak commented Mar 12, 2021