Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new interface for triggers #473

Closed
wlandau opened this issue Jul 20, 2018 · 14 comments
Closed

A new interface for triggers #473

wlandau opened this issue Jul 20, 2018 · 14 comments

Comments

@wlandau
Copy link
Member

wlandau commented Jul 20, 2018

The problem

The current trigger interface is limiting. The choices "any", "always", "depends", "command", and "file" miss important use cases. For example, what if we want to check some remote data before deciding whether to build a target? For most projects, we would need to create a special upstream target with the fingerprint of the data and then have the real target depend on that fingerprint.

Proposal

It would be better to let users write arbitrary R code for each trigger.

drake_plan(
  target1 = {
    get_data(file_in("data.csv"))
    trigger(
      command = FALSE,
      logical = today() == "Tuesday"
    )
  }
  target2 = trigger(custom = remote_hash()),
  target3 = {
    dplyr::bind_rows(target1, tail(target2))
    trigger(
      file = FALSE,
      condition = is_small_enough(target1)
    )
  },
  target4 = {
    summary(target3)
    trigger(
      depends = TRUE,
      change = head(target3, 1)
    )
  }
)

Proposed trigger() function:

#' @title Set the trigger of a target.
#' @description Use this function inside a target's command
#'   in your [drake_plan()]. The target will rebuild if and only if:
#'   - Any of `command`, `depends`, or `file` is `TRUE`, or
#'   - `condition` evaluates to `TRUE`, or
#'   - `change` evaluates to a value different from last time.
#' @export
#' @seealso [drake_plan()], [make()]
#' @return a list of trigger specification details that
#'   `drake` processes internally when it comes time to decide
#'   whether to build the target.
#' @param command logical, whether to rebuild the target if the
#'   [drake_plan()] command changes.
#' @param depends logical, whether to rebuild if a
#'   non-file dependency changes.
#' @param file logical, whether to rebuild the target
#'   if a [file_in()]/[file_out()]/[knitr_in()] file changes.
#' @param condition R code (expression or language object)
#'   that returns a logical. The target will rebuild
#'   if the code evaluates to `TRUE`.
#' @param change R code (expression or language object)
#'  that returns any value. The target will rebuild
#'   if that value is different from last time
#'   or not already cached.
trigger <- function(
  command = TRUE,
  depends = TRUE,
  file = TRUE,
  condition = NULL,
  change = NULL
){
  stopifnot(is.logical(command))
  stopifnot(is.logical(depends))
  stopifnot(is.logical(file))
  list(
    command = command,
    depends = depends,
    file = file,
    condition = rlang::enexpr(condition),
    change = rlang::enexpr(change)
  )
}

People so far

@noamross brought up a motivating scenario in #252. @AlexAxthelm also has a good idea of what is needed because of his drake-powered work with databases. And of course, @krlmlr had the vision behind some of the best parts of the interface, including #232.

Please let me know if I am forgetting anyone. If possible, it would be nice to have input from affected power users.

@wlandau
Copy link
Member Author

wlandau commented Jul 21, 2018

Also cc @kendonB.

@krlmlr
Copy link
Collaborator

krlmlr commented Jul 21, 2018

I'm not too sure about the syntax. I remember a discussion about a drake_command() function that returns a one-row tibble.

drake_plan(
  target1 = drake_command(
    get_data(file_in("data.csv")),
    trigger(
      command = FALSE,
      logical = today() == "Tuesday"
    )
  ),
  ...
)

@kendonB
Copy link
Contributor

kendonB commented Jul 21, 2018

I like the idea and think @krlmlr's syntax is the tidiest. My only concern would be that introducing a change like this might make large plans very slow to start making.

@wlandau
Copy link
Member Author

wlandau commented Jul 21, 2018

I like Kirill's approach too, mainly because it will be SO much easier to implement. I won't need to complicate the code analysis or sensitive internal bookkeeping. Also, there is a clear separation between the command and the trigger, which is great for users.

The target() function is the one that returns the 1-row tibble, and it works inside and outside of drake_plan().

library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
drake_plan(
  x = target(
    command = 1 + 1,
    trigger = "always"
  )
)
#> # A tibble: 1 x 3
#>   target command trigger
#> * <chr>  <chr>   <chr>  
#> 1 x      1 + 1   always
target(
  command = 1 + 1,
  trigger = "always"
)
#> # A tibble: 1 x 2
#>   command trigger
#>   <chr>   <chr>  
#> 1 1 + 1   always

Question: should we detect dependencies from the trigger column of the drake_plan()? On the one hand, like @AlexAxthelm said in our conversation on Thursday, we may need some targets to be up to date before a custom trigger can be evaluated properly. On the other hand, if triggers can change the dependency structure of the workflow, targets could fall out of date from changes to the condition or change argument to trigger(). At this point, I would prefer to

  1. Not add any more dependency detection, and
  2. In the manual, caution users about using targets in trigger() and encourage to also mention them in the command so they register as dependencies.

@wlandau
Copy link
Member Author

wlandau commented Jul 21, 2018

If we go Kirill's route (which I suggest we do), here are some implementation to-do's. These are done on the master branch:

  • As implemented right now, the user needs to explicitly name all arguments of target(). Going forward, we should assume the first unnamed argument is the command unless a command is given later in the argument list.
  • Evaluate wildcards in all custom columns, including the trigger column, in evaluate_plan().

And the rest will only exist on the i473 branch until it is ready to merge.

  • Add and test the trigger() function.
  • Detect dependencies from plan$trigger.
  • In should_build_target(), evaluate the trigger() call and make the appropriate build decision.
    • Ensure the triggers can be resolved whether the trigger column has characters or language objects.
    • In triggers.R, implement functions condition_trigger() and value_trigger() to activate those respective triggers.
  • Deprecate (with a helpful warning) the old character triggers and convert them to the new ones. For example, the "always" trigger becomes trigger(condition = TRUE).
  • Clean up the internal remnants of the old trigger interface:
    • Remove functions in triggers.R that are no longer needed.
    • Remove checks about legal triggers.
  • Write the following unit tests.
    • For each trigger, all positive reactions are correct.
    • For each trigger, all negative reactions are correct.
    • make() brings targets up to date regardless of the trigger.
    • Triggers can be expressions.
    • Triggers in the plan override the trigger argument of make().
    • Old triggers are deprecated and converted to the new ones.

@wlandau
Copy link
Member Author

wlandau commented Jul 21, 2018

And of course, we'll want to elaborate on triggers, custom columns, and target() in the drake_plan() chapter of the manual.

@wlandau
Copy link
Member Author

wlandau commented Jul 21, 2018

@kendonB, Kirill's approach will definitely be at least as fast to execute as what I had in mind, probably faster because there is less static code analysis. If we check the condition trigger first and the value trigger last, I do not think there will be a loss in performance during make().

@krlmlr
Copy link
Collaborator

krlmlr commented Jul 21, 2018

You could change the interface of target(...) to target(.command, ...) ; then the usual R argument matching kicks in.

Just thinking out loud. I'm slightly worried about the added complexity, especially in the light of brittle dependency detection for trigger conditions. I'd be very much interested in a "bare bones" version of drake that is restricted to evaluating commands and scheduling jobs, given the dependency graph and perhaps a run time estimate. Maybe this can be made available as a lower-level interface inside drake, or as a separate package? Complex triggers could then be implemented by automatically adding intermediate helper targets as necessary. (Error and progress messages would use artificial target names, though. Pretty sure there are other complications I'm not aware of.)

@wlandau
Copy link
Member Author

wlandau commented Jul 21, 2018

A bare bones job scheduler is what I was going for with workers, but I have still not figured out a way to cleverly disentangle the job scheduling internals from everything else. I am continuing thinking about it, but I consider it very long term.

As for triggers, my original proposal would have added clutter. But with your idea, I feel there is a lot of gain and not much added complexity.

@wlandau
Copy link
Member Author

wlandau commented Jul 22, 2018

Changing my mind about detecting dependencies from triggers. I think we should analyze the code of plan$trigger just like plan$command. Reasons:

  • The behavior more self-consistent.
  • It avoids surprises. Users expect drake to magically detect dependencies, and it should do so.
  • Yes, this could potentially make workflows more brittle, but only in a select few edge cases. There are so few times when users will really need the condition or change argument of trigger().

@AlexAxthelm
Copy link
Collaborator

I'll chime in that having plan$trigger be analyzed the same way as the command would also be preferable in my mind, since I will probably be writing custom trigger functions (sampling from remote databases, to check if contents have updated), and I would want drake to know if I've changed my function.

@wlandau
Copy link
Member Author

wlandau commented Jul 23, 2018

I think I've got you covered, Alex. The way development is going, if you have a target x with trigger(condition = f()), drake will make sure f() is checked and processed before building x. By default, it will also rebuild x whenever f() changes because drake does not distinguish between the dependencies of the command and those of the trigger. It is a useful simplifying assumption that greatly reduces the complexity of the implementation, but it may not be exactly what you want. To decrease sensitivity, there are times when you might want trigger(condition = f(), depends = FALSE).

@wlandau
Copy link
Member Author

wlandau commented Jul 23, 2018

Implemented via #478. The tests are very thorough, and it's on the master branch now. Let's keep iterating. I do not plan to submit to CRAN again until around August 10.

@wlandau
Copy link
Member Author

wlandau commented Jul 24, 2018

@AlexAxthelm, please take a look at #480. I think the dependencies of triggers should be checked and processed beforehand, but I do not think changes to trigger-only dependencies should necessarily cause commands to rerun. Does the interface still do what you want?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants