Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make using roxygen-like documentation for analysis directories #77

Open
AliciaSchep opened this issue May 11, 2017 · 11 comments

Comments

@AliciaSchep
Copy link

commented May 11, 2017

I've been thinking about whether it would be possible & useful to have roxygen-like tags for documenting input and outputs of analysis scripts that could be used for easily creating a makefile when needed. This idea is very related to first part of thread #5, particularly the second comment (from @njtierney) about the struggle to go from exploratory analysis to something reproducible and subsequent discussion of make, but as that thread has moved on a bit into testing/CI/pkg issues I figured I'd started a new thread.

The idea would be that in a given R script (or Rmarkdown) you might at some point read in inputs and at other points write outputs. You could tag inputs and outputs:

#' myfile.csv
#' A really cool data file!
#' @source coolwebsite.com
#' @input myinfile.csv
mytable <- read_csv("myfile.csv")

myoutput <- do_stuff(mytable)

#' myoutput.rds
#' My awesome calculated result
#' @output myoutput.rds
saveRDS(myoutput)

Then another script might have:

#' @input myoutput.rds
myinput <- readRDS("myoutput.rds")

Within the directory containing all these scripts, you could run a command that reads through all the scripts and their input and output files and creates a makefile. If there are any circular dependencies those would get flagged. The command would also create man pages for each input and output object, as well as an overall workflow documentation with a dependency graph linking to individual input/output documentation.

There already is an R package to automatically make makefiles from R scripts -- easyMake. It tries to automatically detect when a file reads in an input or exports a file. I think roxygen-like tags might be a bit more flexible and transparent, as you would be able to specify each input and output file without having to rely on all the input and output functions used being recognized. This roxygen-like system would also enable creation of a better documentation of the workflow and inputs/outputs than just the makefile or a dependency graph of filenames.

Perhaps rather than creating a new roxygen-like system, roxygen itself could also be adapted for this purpose?

@stephlocke

This comment has been minimized.

Copy link

commented May 11, 2017

I like the idea of enhanced metadata & documentation for my work

@bzkrouse

This comment has been minimized.

Copy link

commented May 11, 2017

Nice idea, I'm also interested in giving more attention to the struggle of organizing and keeping track of exploratory analysis. The concept of collecting metadata on analysis was also discussed in #23 - although also with emphasis on collecting information about results.

@MilesMcBain

This comment has been minimized.

Copy link
Contributor

commented May 17, 2017

I only just noticed this issue in the midst of cleaning up mine. I think what you're describing here is a REALLY great idea. How about a name: makedown? 😉

@hadley

This comment has been minimized.

Copy link
Member

commented May 18, 2017

I like this idea but I think generating a makefile will be error prone. Will be more robust (if more work) to manage the dependency graph in R itself.

@AliciaSchep

This comment has been minimized.

Copy link
Author

commented May 19, 2017

Thanks @bzkrouse for this linking this to thread #23, I hadn't read through that one yet, and some of the goals are certainly shared, although I think this idea is more limited in scope. Compared to some of the fairly comprehensive systems discussed in that thread, the idea here is for something fairly minimal and very easy to incorporate into existing script-based anslyses

@MilesMcBain makedown sounds like a great name! Even if ultimately make itself isn't actually used...

As for using make versus managing things in R itself, I think the main benefit of using make is less work 😁 Although perhaps generating the makefile in a reliable way may prove harder than I am anticipating...

@AliciaSchep AliciaSchep changed the title roxygen-like documentation for analysis directories make using roxygen-like documentation for analysis directories May 19, 2017

@hadley

This comment has been minimized.

Copy link
Member

commented May 19, 2017

Generating the makefile will allow you get to a quick proof of concept up and running, and that's a great goal for the unconf. However, code generation in general is hard, and having the dependency graph in another environment means you can't do cool visualisations in R etc.

@hadley

This comment has been minimized.

Copy link
Member

commented May 19, 2017

Another thing worth considering is if you could automatically detect inputs/outputs for many common situations - i.e. in your example above, you could parse the file and detect read.csv() and saveRDS() and automatically generate the input/output annotations. You'd still need manual annotations for non-standard functions, but you might be able give people a fairly comprehensive solution for free.

@hadley

This comment has been minimized.

Copy link
Member

commented May 19, 2017

It would also be handy to be able have this work inline, although you'd need someway to represent that the output was new R objects:

if_needed(
  input = c(object("types"), "my_csv.csv"),
  output = c(object("df"), "my_plot.pdf"),
  {
    df <- read_csv("my_csv.csv", col_types = types)
    ggplot(df, aes(x, y)) + geom_point()
    ggsave("my_plot.pdf")
  }
)

And in that case you could determine the inputs and outputs from the code, so you could just write:

if_needed({
  df <- read_csv("my_csv.csv", col_types = types)
  ggplot(df, aes(x, y)) + geom_point()
  ggsave("my_plot.pdf")
})
@hadley

This comment has been minimized.

Copy link
Member

commented May 22, 2017

I hope you don't mind but I've taken your basic idea and run with it: https://docs.google.com/document/d/1avYAqjTS7zSZn7JAAOZhFPkhkPvYwaPVrSpo31Cu0Yc/edit#. I'd love your thoughts!

@AliciaSchep

This comment has been minimized.

Copy link
Author

commented May 23, 2017

Definitely don't mind, looks great! In terms of my original idea, there were two related goals, one of which was linking dependencies across R files (without having to create your own make file), and the other was to enable documentation of inputs and outputs so as to be able to create documented dependency graph. Proposal for lazyr seems like great solution for first goal, but doesn't necessarily help for second, although perhaps those goals shouldn't have been linked anyways.

@coatless

This comment has been minimized.

Copy link

commented Jun 12, 2017

This feels like the merging of CodeDepends and YesWorkflow / Live Demo, which would be very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.