Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential to add check that data files are committed #242

Open
stephens999 opened this issue Mar 7, 2021 · 4 comments
Open

potential to add check that data files are committed #242

stephens999 opened this issue Mar 7, 2021 · 4 comments

Comments

@stephens999
Copy link
Member

How feasible would it be to add a check that any files being sourced or loaded are included in the repo?
eg. it seems to be an easy mistake to make to have a line like:

dat = readRDS('data/my_dat.rds')

in my Rmd file but forget to publish data/my_dat.rds
The ability to check for this kind of thing could be helpful, although maybe not so straightforward?

@jdblischak
Copy link
Member

How feasible would it be to add a check that any files being sourced or loaded are included in the repo?

I think this is a great feature, but to be robust, it would need a more substantial solution like the ideas discussed in #9. Without the user specifying which files are input (and should be committed), then it's very hard for workflowr to guess what should be done.

Without this extra infrastructure, we could do something like the following:

  1. When a file is built, do a regex search for file paths starting with data/.
  2. Check if these files have been committed
  3. If they haven't been committed, try to warn the user in some way. Maybe a warning message after the code chunk saying that the previous file wasn't committed? I don't want to add a failing check to the workflowr report, since a user may not want to commit the file (maybe it's a huge file, and it's available to download on a public archive)

A big caveat of the above is that this check only works for files in data/, which is only a suggested directory name.

Thoughts?

@pcarbo
Copy link
Member

pcarbo commented Mar 8, 2021

I agree it would be a nice feature to have despite inevitable caveats.

@stephens999
Copy link
Member Author

stephens999 commented Mar 10, 2021 via email

@jdblischak
Copy link
Member

What if a Rmd file saves a file to data/ rather than loads it?

The more I think about the caveats, the more I don't like this feature. It's a good point that users can write to data/ as well. In that case, it wouldn't make sense to throw a warning that the file hasn't been committed yet. In fact, it might not even exist at the time the workflowr code is running to check its Git status.

The problem is that we don't have any clever way to really know what the code is doing as far as input/output. Some examples:

# Is the code reading or writing?
x <- customPkg::customFunc("data/file.txt")

# Regexes require the file path to be a contiguous string
x <- read.table(file.path("data", "file.txt"))

# If the user sets knit_root_dir to analysis/ and uses the here package to resolve file paths.
# Workflowr would look for the file in analysis/data/file.txt b/c it's not executing the code
x <- read.table(here::here("data/file.txt"))

This problem reminds me of how the drake package handles dependencies in Rmd files. You have to use its custom function readd() to import the data file. This is the only way that drake can know which files are input files that need to be checked to see if they have been modified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants