Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A complete tidyselect integration for the columns argument in validation functions #493

Merged
merged 42 commits into from Oct 28, 2023

Conversation

yjunechoe
Copy link
Collaborator

@yjunechoe yjunechoe commented Oct 17, 2023

Hi Rich -

Making my way over from your R/Pharma pointblank workshop yesterday - thanks for the great package! I've actually been accumulating some thoughts about a complete tidyselect integration for pointblank, and the workshop nudged me to wrap up my thoughts into a PR.

Intro

My understanding of the column selection logic as it currently exists is that the columns argument of user-facing validations functions is always enquo()-ed, and forwarded to (at least) three different implementations:

  1. If input is a vars() call, deparse the elements and return the column names in a character vector.

  2. If input is a call to one of the supported {tidyselect} functions in exported_tidyselect_fns, forward the quosure to eval_select().

  3. For all other cases, the input (sometimes character vector, sometimes a symbol) is forwarded to vars_select() as quosure and defused there.

This PR tries to unify these various behaviors under tidyselect::eval_select() (mostly by rewriting the resolve_columns() internal function). The proposal is to make all user-supplied expressions to columns pass through {tidyselect}, as much as possible.

Problems

There are (or were, if I did this right) four points of friction for a complete {tidyselect}-backend refactoring:

  1. Both users and developers of pointblank can freely switch between passing expressions vs. character vectors to the column argument without being explicit about which one we're intending. The PRO is that the function is smart and it "just works", which is convenient. The CON is that it requires/causes the kind of fractured implementation of the selection logic that I described above. The solution is to expect the argument to always be captured/quoted by default, unless it's specifically marked for ordinary evaluation. dplyr::select() has already paved way for this by introducing all_of(). So anytime in the source code we do something like columns = columns and expect the columns argument to be ordinarily evaluated as a character vector, this PR ensures that the input is wrapped in all_of() to let eval_select() handle it:

  2. The way we currently implement vars() will become somewhat of an off-label usage in a {tidyselect} context. There's no way to make this straightforwardly work, so I just special case this for backwards compatibility. Luckily, {tidyselect} already offers this capability with c(), so the PR simply intercepts the vars() expression and replaces vars with c before sending it off to eval_select().

  3. eval_select() is strict about resolving column selections - it throws an error if the selection is invalid. In our case, this is a doubled edged sword (see also the error in tests at the bottom of this PR). On the one hand, we get the really informative tidyverse-style error messages it generates for free, instead of the generic message that pointblank currently throws. On the other hand, the dataframe that resolve_columns() receives is sometimes empty, with no columns (ex: in the case of serially()). In the PR, I decided to just special case empty data - the vars() expression in that case is simply deparsed into a character vector, bypassing tidyselect entirely (leaving it up to pointblank to catch any column selection errors downstream).

  4. Some validation functions (ex: col_exists()) simply deparse the user-supplied column expression immediately without doing anything more with it. These should get re-routed to resolve_columns() for consistency (of having all user-facing column selection logic be powered by {tidyselect}).

The first three actually turned out to be pretty trivial to address, as far as refactoring is concerned - It took a re-write of the resolve_columns() function + wrapping the input to columns in all_of() whenever we're passing a character vector, forcing the source code to be intentional about it. The fourth friction point requires editing the body of individual validation functions, so I'm holding them off for now.

Example

To demonstrate what the PR enables, here it is in action for rows_distinct(), which seemed like the prime candidate for this refactoring:

tbl <- data.frame(x = 1:2, y = "A")
expect_rows_distinct(tbl, c(x, y))
expect_rows_distinct(tbl, c(1, 2))
expect_rows_distinct(tbl, !tidyselect::where(is.character))
expect_rows_distinct(tbl, tidyselect::all_of(c("x", "y")))
expect_rows_distinct(tbl, c(x, tidyselect::any_of(c("y", "z"))))

One welcome consequence of this is that the column information is no longer NA for the rows_distinct() validation step:

agent <- small_table %>%
  create_agent() %>%
  rows_distinct(c(a, d:f)) %>%
  interrogate()
unlist(agent$validation_set$column)
#> [1] "a, d, e, f"

A new test file test-tidyselect_integration.R has a more comprehensive coverage of {tidyselect} features that we expect to work.

Remaining considerations

This is a draft PR because of these outstanding issues:

Deprecation

We might want to consider signalling deprecation for 1) vars() (use c() instead), and 2) passing character vectors expecting it to "just work" (wrap it in all_of() instead). Users have already built up intuition about this behavior from using dplyr, so it hopefully shouldn't be too difficult to grok. I haven't implemented a vars() deprecation warning in this PR yet (I don't know how disruptive that'd be) but in the case of all_of() the PR just lets the deprecation warning trickle up from tidyselect.

Coverage and WIP status

This PR needs to finish integrating the new tidyselect implementation for some of the validation functions that still use the old implementation (ex: rows_complete()) and also for those that just don't do tidyselect entirely (ex: col_exists())

Tests

outdated

When I devtools::test() (Windows), I get:

[ FAIL 2 | WARN 16 | SKIP 7 | PASS 5737 ]

Compared to main, this PR introduces 1 new FAIL and 10 new WARNs:

  • The warnings are mostly deprecation errors from using or missing all_of() in the appropriate places. I'm in the process of hunting the rest of these down.

  • The one new error may be a breaking change - the PR changes pointblank's behavior on this test:

# The following agent will perform an interrogation that results
# in all test units passing in the second validation step, but
# the first experiences an evaluation error (since column
# `z` doesn't exist in `small_table`)
agent <-
create_agent(tbl = small_table) %>%
col_vals_not_null(vars("z")) %>%
col_vals_gt(vars(c), 1, na_pass = TRUE) %>%
interrogate()

Whereas pointblank agents strive to delay the evaluation of the validation checks until interrogate(), passing the column inputs through eval_select() triggers immediate errors for non-existent columns. In other words, the agent fails to "build" entirely so it fails the tests for this agent. I'm sure it's possible to put the old behavior back in (of gracefully signaling the "evaluation issue requires attention" error) but just wanted to put it out there first before I do anything about it since I think this one requires some more thought.

Related GitHub Issues and PRs

I'm sure this cuts across several open issues, but to list a few that this PR would close if worked to completion:

Checklist

@yjunechoe yjunechoe marked this pull request as draft October 17, 2023 20:45
@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Oct 19, 2023

A small update on the one test that's failing in this PR:

outdated (failing test now resolved - tidyselect errors "lazily")

I mischaracterized the issue. Regardless of whether the input is an agent or a table, the validation step will immediately resolve the column. The difference is just whether columns is resolved by tidyselect, which triggers an immediate error for non-existent columns.

So even on the main branch, if I swap vars("z") with something that gets passed to tidyselect (like the symbol z or even just "z"), then that errors:

library(pointblank)
create_agent(small_table) %>% col_vals_not_null(vars("z")) %>% is("ptblank_agent")
#> [1] TRUE
create_agent(small_table) %>% col_vals_not_null(vars(z))   %>% is("ptblank_agent")
#> [1] TRUE
create_agent(small_table) %>% col_vals_not_null(z)         %>% is("ptblank_agent")
#> Error in `resolve_expr_to_cols()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `z` doesn't exist.
create_agent(small_table) %>% col_vals_not_null("z")       %>% is("ptblank_agent")
#> Error in `resolve_expr_to_cols()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `z` doesn't exist.

So I think this raises an additional question (which would also be nice to resolve with this PR):

Is the current behavior of vars() above a feature, an off-label usage, or a bug? More generally, are there any cases in which users would benefit from delaying the validation of column selection until interrogate()?

  • If it is a feature, this PR can keep the exceptional behavior of vars() by just recycling the old implementation (of deparsing its contents immediately and bypassing tidyselect).
  • If it is off-label usage, the PR should signal some process of deprecating/dropping this behavior.
  • If it is a bug, the PR ensures that vars("z") behaves like z, "z" and other selection patterns, so we should remove/edit the test that it's currently failing.

Full exploration of vars() exceptionalism:

library(pointblank) # devel @main
library(testthat)
#> 
#> Attaching package: 'testthat'
#> The following object is masked from 'package:pointblank':
#> 
#>     matches

test_that("Exceptional behavior of `vars()` with non-existent columns", {
  withr::local_options(lifecycle_verbosity = "quiet")
  
  # Supported variety of column selection patterns work for an existing column:
  col_a <- "a"
  expect_no_error({
    # `vars()` syntax
    small_table %>% col_vals_not_null(vars(a))
    # Other column selection expressions
    small_table %>% col_vals_not_null(a)
    small_table %>% col_vals_not_null("a")
    small_table %>% col_vals_not_null(col_a)
  })
  
  # All patterns error when the input is a **table** and column is non-existent
  col_z <- "z"
  expect_error( small_table %>% col_vals_not_null(vars("z")) )
  expect_error( small_table %>% col_vals_not_null(z) )
  expect_error( small_table %>% col_vals_not_null("z") )
  expect_error( small_table %>% col_vals_not_null(col_z) )
  
  # When input is an **agent**, all error except `vars()` - is this a feature?
  # - `vars()` is special-cased: it's deparsed immediately and bypasses tidyselect
  small_agent <- create_agent(small_table)
  expect_error( small_agent %>% col_vals_not_null(z) )
  expect_error( small_agent %>% col_vals_not_null("z") )
  expect_error( small_agent %>% col_vals_not_null(col_z) )
  expect_no_error( small_agent %>% col_vals_not_null(vars("z")) )
  
  # The current test treats this behavior of `vars()` as a feature and expects:
  # 1) Interrogation of the agent with the problematic validation runs
  # 2) Upon interrogation, it captures the error and sends a message about it gracefully
  expect_no_error(
    x <- small_agent %>% col_vals_not_null(vars("z")) %>% interrogate()
  )
  expect_true(x$validation_set$eval_error)
  
})
#> Test passed 🎉

@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Oct 19, 2023

I just finished hunting down tidyselect-related deprecation warnings and the PR now stands at:

[ FAIL 2 | WARN 4 | SKIP 7 | PASS 5740 ]

Which still includes the 1 error from test-get_agent_report.R that I described above. The others error/warnings are unrelated to the PR. details below:

Rest of test warn/fails
  • An error from test-scan_data.R thrown by scan_data(), which I also get on the main branch. This may just be something about my setup, so I haven't touched this:

Error in purrr::map_chr(order, as_name): ℹ In index: 1.
Caused by error in as_string():
! Can't convert a call to a string.

  • The remaining 4 warnings are from test-yaml.R (and also present on the main branch), and it's about another deprecated behavior. I have an idea for resolving this but I can move that to a separate PR:

Quosure lists can't be concatenated with objects other than quosures as of rlang
0.3.0. Please call as.list() on the quosure list first.


I think the PR is ready for course-correcting now. Thanks in advance for considering!

@rich-iannone
Copy link
Member

rich-iannone commented Oct 19, 2023

Thanks so much for all the careful work you've put into this! The story with vars() is that I wanted (in the values arg only) to specially emphasize that the value tested against relates to dynamic values (and not a single static value) in a column. At first I thought the key distinction could be: (1) numbers for values and (2) strings for column names, however, a valid value in value could be a static string. At least in some backends you could have number-string comparisons and string-string comparisons. That has sort of led to this coding confusion.

If we can gracefully move away from that, have laziness like we currently do (i.e., failures happen in the report, not during development of the validation plan), and not break current workflows using vars() but slowly deprecate, this would be an incredibly awesome development.

I'll read over your notes/changes a bit more carefully in the coming days (just so happens that I'm busier than usual right now) but for now I just wanted to say this is headed in a great direction and, again, I appreciate the work you've been doing here!

@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Oct 20, 2023

Awesome - thanks for being receptive to this! And I appreciate the additional context as well - they're very helpful to know at this point of the PR.

I'll let you take the time to have a closer look at the changes. In the meantime, if I understand correctly, the next steps should be:

1) Make sure that column selection failures are only signaled at interrogate() (when validations are executed against the dataframe), and not while building up the validation steps in an agent.

I just took a stab at this by splitting resolve_columns() into two parts: an inner function resolve_columns_internal() implementing tidyselect and an outer function resolve_columns() that implements try-catch. Tidyselect errors are always caught and only re-thrown when outside of validation-building contexts. Otherwise, the columns that tidyselect attempted to select will be returned, only to get caught again when the agent is eventually interrogate()-ed.

The tricky part of this is ensuring that tidyselect do throw errors in test_*() and expect_*(): those involve agents under the hood too, so they're very similar to the kind of code that users write. One way in which they differ is that they're "secret ("::QUIET::") agents" - I don't know how brittle this is but that's currently the logic I'm using to disambiguate between the two cases, where expectations/tests should throw immediate errors vs. in the "normal" cases for users, where we instead want errors to be signaled gracefully via a 💣 in the report .

Other than that sticking point, I think this implementation works pretty well. I've added another test file test-tidyselect_fails_safely.R covering additional behaviors we expect from column selection errors.

2) Figure out what to do with value and value = var(col)

I think this could be handled separately from this PR (which could then just focus on columns). I've moved my thoughts on the value argument of column comparison functions to #494

@rich-iannone
Copy link
Member

The work you’re doing here is quite significant so please add yourself as an author (but only if you want to be a co-author!).

@yjunechoe
Copy link
Collaborator Author

Thanks - yes, I'd be honored to!

@yjunechoe
Copy link
Collaborator Author

The basic groundwork for a complete tidyselect integration in the columns argument is now complete!

  • All user-facing validation functions now pass the columns input through resolve_columns(), enabling the full range of tidyselect expressions. To be specific, tidyselect support has been extended to rows_distinct(), rows_complete(), and col_exists(), with their peculiarities preserved.
  • During validation-planning, all tidyselect failures are delayed instead of erroring immediately. This preserves the old behavior of 💥-ing at report
  • Backwards compatibility with vars() is guaranteed, for vars(x), vars("x"), var(!!col) and even some more funky ones I've found in the wild like vars(ends_with("x")).

The remaining issues that'd be nice to resolve as a part of this PR are:

  1. Should the use of vars() in columns signal some deprecation message/warning? (when, how frequent, how specific, etc.)

  2. Are there any changes to the documentation that are in order? I assume we'd want to at least modify this part of the "Column Names" section:

    Aside from column names in quotes and in vars(), tidyselect helper functions are available for specifying columns. They are: starts_with(), ends_with(), contains(), matches(), and everything().

I have a few other smaller issues that I'm keeping track of in my mind but they could probably be tackled as separate PRs, so I'll refrain from complicating this PR any further than this.

@yjunechoe yjunechoe marked this pull request as ready for review October 23, 2023 19:19
@yjunechoe yjunechoe changed the title A complete tidyselect integration for user-facing validation functions A complete tidyselect integration for the columns argument in validation functions Oct 23, 2023
@rich-iannone
Copy link
Member

The basic groundwork for a complete tidyselect integration in the columns argument is now complete!

  • All user-facing validation functions now pass the columns input through resolve_columns(), enabling the full range of tidyselect expressions. To be specific, tidyselect support has been extended to rows_distinct(), rows_complete(), and col_exists(), with their peculiarities preserved.
  • During validation-planning, all tidyselect failures are delayed instead of erroring immediately. This preserves the old behavior of 💥-ing at report
  • Backwards compatibility with vars() is guaranteed, for vars(x), vars("x"), var(!!col) and even some more funky ones I've found in the wild like vars(ends_with("x")).

This is really excellent work! It's good that there's backwards compatibility with vars() because I imagine there's a fair amount of deployed code that uses it. And we even write expressions with vars() during output to YAML (see

as_vars_fn <- function(columns) {
) so we need to be able to interpret that for YAML round-tripping.

The remaining issues that'd be nice to resolve as a part of this PR are:

  1. Should the use of vars() in columns signal some deprecation message/warning? (when, how frequent, how specific, etc.)

I think we should instead remove references to vars() in documentation and examples. By having it still work in an un-advertised way means that users will slowly migrate over to vars()-less input. I think after a release or two, we could then signal deprecation.

  1. Are there any changes to the documentation that are in order? I assume we'd want to at least modify this part of the "Column Names" section:

    Aside from column names in quotes and in vars(), tidyselect helper functions are available for specifying columns. They are: starts_with(), ends_with(), contains(), matches(), and everything().

We should definitely move over all examples to the preferred form and make sure that the YAML writing is modified to remove its use of vars(). This could happen over a few PRs!

I have a few other smaller issues that I'm keeping track of in my mind but they could probably be tackled as separate PRs, so I'll refrain from complicating this PR any further than this.

Thanks for thinking further ahead on this. Feel free to create issues or just go for it with new PRs!

I'll do a bit of manual testing with this PR, just to see if there are any new issues. Again, thank you for all the work!

@yjunechoe
Copy link
Collaborator Author

Gotcha - I'll let you finish your side of tests before I do anything more here (though I think I've put in everything I wanted - happy to move residual issues into their own PRs once this big one is merged).

Please let me know if any part of the code is unclear!

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

@rich-iannone rich-iannone merged commit ca26d12 into rstudio:main Oct 28, 2023
12 of 13 checks passed
@yjunechoe yjunechoe deleted the tidyselect-coverage branch October 28, 2023 18:16
yjunechoe added a commit that referenced this pull request Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants