A complete tidyselect integration for the `columns` argument in validation functions #493

yjunechoe · 2023-10-17T20:42:54Z

Hi Rich -

Making my way over from your R/Pharma pointblank workshop yesterday - thanks for the great package! I've actually been accumulating some thoughts about a complete tidyselect integration for pointblank, and the workshop nudged me to wrap up my thoughts into a PR.

Intro

My understanding of the column selection logic as it currently exists is that the columns argument of user-facing validations functions is always enquo()-ed, and forwarded to (at least) three different implementations:

If input is a vars() call, deparse the elements and return the column names in a character vector.
If input is a call to one of the supported {tidyselect} functions in exported_tidyselect_fns, forward the quosure to eval_select().
For all other cases, the input (sometimes character vector, sometimes a symbol) is forwarded to vars_select() as quosure and defused there.

This PR tries to unify these various behaviors under tidyselect::eval_select() (mostly by rewriting the resolve_columns() internal function). The proposal is to make all user-supplied expressions to columns pass through {tidyselect}, as much as possible.

Problems

There are (or were, if I did this right) four points of friction for a complete {tidyselect}-backend refactoring:

Both users and developers of pointblank can freely switch between passing expressions vs. character vectors to the column argument without being explicit about which one we're intending. The PRO is that the function is smart and it "just works", which is convenient. The CON is that it requires/causes the kind of fractured implementation of the selection logic that I described above. The solution is to expect the argument to always be captured/quoted by default, unless it's specifically marked for ordinary evaluation. dplyr::select() has already paved way for this by introducing all_of(). So anytime in the source code we do something like columns = columns and expect the columns argument to be ordinarily evaluated as a character vector, this PR ensures that the input is wrapped in all_of() to let eval_select() handle it:
The way we currently implement vars() will become somewhat of an off-label usage in a {tidyselect} context. There's no way to make this straightforwardly work, so I just special case this for backwards compatibility. Luckily, {tidyselect} already offers this capability with c(), so the PR simply intercepts the vars() expression and replaces vars with c before sending it off to eval_select().
eval_select() is strict about resolving column selections - it throws an error if the selection is invalid. In our case, this is a doubled edged sword (see also the error in tests at the bottom of this PR). On the one hand, we get the really informative tidyverse-style error messages it generates for free, instead of the generic message that pointblank currently throws. On the other hand, the dataframe that resolve_columns() receives is sometimes empty, with no columns (ex: in the case of serially()). In the PR, I decided to just special case empty data - the vars() expression in that case is simply deparsed into a character vector, bypassing tidyselect entirely (leaving it up to pointblank to catch any column selection errors downstream).
Some validation functions (ex: col_exists()) simply deparse the user-supplied column expression immediately without doing anything more with it. These should get re-routed to resolve_columns() for consistency (of having all user-facing column selection logic be powered by {tidyselect}).

The first three actually turned out to be pretty trivial to address, as far as refactoring is concerned - It took a re-write of the resolve_columns() function + wrapping the input to columns in all_of() whenever we're passing a character vector, forcing the source code to be intentional about it. The fourth friction point requires editing the body of individual validation functions, so I'm holding them off for now.

Example

To demonstrate what the PR enables, here it is in action for rows_distinct(), which seemed like the prime candidate for this refactoring:

tbl <- data.frame(x = 1:2, y = "A")
expect_rows_distinct(tbl, c(x, y))
expect_rows_distinct(tbl, c(1, 2))
expect_rows_distinct(tbl, !tidyselect::where(is.character))
expect_rows_distinct(tbl, tidyselect::all_of(c("x", "y")))
expect_rows_distinct(tbl, c(x, tidyselect::any_of(c("y", "z"))))

One welcome consequence of this is that the column information is no longer NA for the rows_distinct() validation step:

agent <- small_table %>%
  create_agent() %>%
  rows_distinct(c(a, d:f)) %>%
  interrogate()
unlist(agent$validation_set$column)
#> [1] "a, d, e, f"

A new test file test-tidyselect_integration.R has a more comprehensive coverage of {tidyselect} features that we expect to work.

Remaining considerations

This is a draft PR because of these outstanding issues:

Deprecation

We might want to consider signalling deprecation for 1) vars() (use c() instead), and 2) passing character vectors expecting it to "just work" (wrap it in all_of() instead). Users have already built up intuition about this behavior from using dplyr, so it hopefully shouldn't be too difficult to grok. I haven't implemented a vars() deprecation warning in this PR yet (I don't know how disruptive that'd be) but in the case of all_of() the PR just lets the deprecation warning trickle up from tidyselect.

Coverage and WIP status

This PR needs to finish integrating the new tidyselect implementation for some of the validation functions that still use the old implementation (ex: rows_complete()) and also for those that just don't do tidyselect entirely (ex: col_exists())

Tests

outdated

When I devtools::test() (Windows), I get:

[ FAIL 2 | WARN 16 | SKIP 7 | PASS 5737 ]

Compared to main, this PR introduces 1 new FAIL and 10 new WARNs:

The warnings are mostly deprecation errors from using or missing all_of() in the appropriate places. I'm in the process of hunting the rest of these down.
The one new error may be a breaking change - the PR changes pointblank's behavior on this test:

pointblank/tests/testthat/test-get_agent_report.R

Lines 84 to 92 in dc1b917

    
           # The following agent will perform an interrogation that results 
        
           # in all test units passing in the second validation step, but 
        
           # the first experiences an evaluation error (since column 
        
           # `z` doesn't exist in `small_table`) 
        
           agent <- 
        
             create_agent(tbl = small_table) %>% 
        
             col_vals_not_null(vars("z")) %>% 
        
             col_vals_gt(vars(c), 1, na_pass = TRUE) %>% 
        
             interrogate()

Whereas pointblank agents strive to delay the evaluation of the validation checks until interrogate(), passing the column inputs through eval_select() triggers immediate errors for non-existent columns. In other words, the agent fails to "build" entirely so it fails the tests for this agent. I'm sure it's possible to put the old behavior back in (of gracefully signaling the "evaluation issue requires attention" error) but just wanted to put it out there first before I do anything about it since I think this one requires some more thought.

Related GitHub Issues and PRs

I'm sure this cuts across several open issues, but to list a few that this PR would close if worked to completion:

Ref: vars() seems required for rows_*() but not other functions #416, ends_with() doesn't work with expect_col_exists() #433, Allow use of select helpers in col_exists() #221

Checklist

I understand and agree to the Code of Conduct.
I have listed any major changes in the NEWS.
I have added testthat unit tests to tests/testthat for any new functionality.

…aluation in all_of

…inct()

yjunechoe · 2023-10-19T03:51:01Z

A small update on the one test that's failing in this PR:

outdated (failing test now resolved - tidyselect errors "lazily")

I mischaracterized the issue. Regardless of whether the input is an agent or a table, the validation step will immediately resolve the column. The difference is just whether columns is resolved by tidyselect, which triggers an immediate error for non-existent columns.

So even on the main branch, if I swap vars("z") with something that gets passed to tidyselect (like the symbol z or even just "z"), then that errors:

library(pointblank)
create_agent(small_table) %>% col_vals_not_null(vars("z")) %>% is("ptblank_agent")
#> [1] TRUE
create_agent(small_table) %>% col_vals_not_null(vars(z))   %>% is("ptblank_agent")
#> [1] TRUE
create_agent(small_table) %>% col_vals_not_null(z)         %>% is("ptblank_agent")
#> Error in `resolve_expr_to_cols()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `z` doesn't exist.
create_agent(small_table) %>% col_vals_not_null("z")       %>% is("ptblank_agent")
#> Error in `resolve_expr_to_cols()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `z` doesn't exist.

So I think this raises an additional question (which would also be nice to resolve with this PR):

Is the current behavior of vars() above a feature, an off-label usage, or a bug? More generally, are there any cases in which users would benefit from delaying the validation of column selection until interrogate()?

If it is a feature, this PR can keep the exceptional behavior of vars() by just recycling the old implementation (of deparsing its contents immediately and bypassing tidyselect).
If it is off-label usage, the PR should signal some process of deprecating/dropping this behavior.
If it is a bug, the PR ensures that vars("z") behaves like z, "z" and other selection patterns, so we should remove/edit the test that it's currently failing.

Full exploration of vars() exceptionalism:

library(pointblank) # devel @main
library(testthat)
#> 
#> Attaching package: 'testthat'
#> The following object is masked from 'package:pointblank':
#> 
#>     matches

test_that("Exceptional behavior of `vars()` with non-existent columns", {
  withr::local_options(lifecycle_verbosity = "quiet")
  
  # Supported variety of column selection patterns work for an existing column:
  col_a <- "a"
  expect_no_error({
    # `vars()` syntax
    small_table %>% col_vals_not_null(vars(a))
    # Other column selection expressions
    small_table %>% col_vals_not_null(a)
    small_table %>% col_vals_not_null("a")
    small_table %>% col_vals_not_null(col_a)
  })
  
  # All patterns error when the input is a **table** and column is non-existent
  col_z <- "z"
  expect_error( small_table %>% col_vals_not_null(vars("z")) )
  expect_error( small_table %>% col_vals_not_null(z) )
  expect_error( small_table %>% col_vals_not_null("z") )
  expect_error( small_table %>% col_vals_not_null(col_z) )
  
  # When input is an **agent**, all error except `vars()` - is this a feature?
  # - `vars()` is special-cased: it's deparsed immediately and bypasses tidyselect
  small_agent <- create_agent(small_table)
  expect_error( small_agent %>% col_vals_not_null(z) )
  expect_error( small_agent %>% col_vals_not_null("z") )
  expect_error( small_agent %>% col_vals_not_null(col_z) )
  expect_no_error( small_agent %>% col_vals_not_null(vars("z")) )
  
  # The current test treats this behavior of `vars()` as a feature and expects:
  # 1) Interrogation of the agent with the problematic validation runs
  # 2) Upon interrogation, it captures the error and sends a message about it gracefully
  expect_no_error(
    x <- small_agent %>% col_vals_not_null(vars("z")) %>% interrogate()
  )
  expect_true(x$validation_set$eval_error)
  
})
#> Test passed 🎉

yjunechoe · 2023-10-19T15:06:47Z

I just finished hunting down tidyselect-related deprecation warnings and the PR now stands at:

[ FAIL 2 | WARN 4 | SKIP 7 | PASS 5740 ]

Which still includes the 1 error from test-get_agent_report.R that I described above. The others error/warnings are unrelated to the PR. details below:

Rest of test warn/fails

An error from test-scan_data.R thrown by scan_data(), which I also get on the main branch. This may just be something about my setup, so I haven't touched this:

Error in purrr::map_chr(order, as_name): ℹ In index: 1.
Caused by error in as_string():
! Can't convert a call to a string.

The remaining 4 warnings are from test-yaml.R (and also present on the main branch), and it's about another deprecated behavior. I have an idea for resolving this but I can move that to a separate PR:

Quosure lists can't be concatenated with objects other than quosures as of rlang
0.3.0. Please call as.list() on the quosure list first.

I think the PR is ready for course-correcting now. Thanks in advance for considering!

rich-iannone · 2023-10-19T16:36:31Z

Thanks so much for all the careful work you've put into this! The story with vars() is that I wanted (in the values arg only) to specially emphasize that the value tested against relates to dynamic values (and not a single static value) in a column. At first I thought the key distinction could be: (1) numbers for values and (2) strings for column names, however, a valid value in value could be a static string. At least in some backends you could have number-string comparisons and string-string comparisons. That has sort of led to this coding confusion.

If we can gracefully move away from that, have laziness like we currently do (i.e., failures happen in the report, not during development of the validation plan), and not break current workflows using vars() but slowly deprecate, this would be an incredibly awesome development.

I'll read over your notes/changes a bit more carefully in the coming days (just so happens that I'm busier than usual right now) but for now I just wanted to say this is headed in a great direction and, again, I appreciate the work you've been doing here!

yjunechoe · 2023-10-20T02:47:46Z

Awesome - thanks for being receptive to this! And I appreciate the additional context as well - they're very helpful to know at this point of the PR.

I'll let you take the time to have a closer look at the changes. In the meantime, if I understand correctly, the next steps should be:

1) Make sure that column selection failures are only signaled at `interrogate()` (when validations are executed against the dataframe), and not while building up the validation steps in an agent.

I just took a stab at this by splitting resolve_columns() into two parts: an inner function resolve_columns_internal() implementing tidyselect and an outer function resolve_columns() that implements try-catch. Tidyselect errors are always caught and only re-thrown when outside of validation-building contexts. Otherwise, the columns that tidyselect attempted to select will be returned, only to get caught again when the agent is eventually interrogate()-ed.

The tricky part of this is ensuring that tidyselect do throw errors in test_*() and expect_*(): those involve agents under the hood too, so they're very similar to the kind of code that users write. One way in which they differ is that they're "secret ("::QUIET::") agents" - I don't know how brittle this is but that's currently the logic I'm using to disambiguate between the two cases, where expectations/tests should throw immediate errors vs. in the "normal" cases for users, where we instead want errors to be signaled gracefully via a 💣 in the report .

Other than that sticking point, I think this implementation works pretty well. I've added another test file test-tidyselect_fails_safely.R covering additional behaviors we expect from column selection errors.

2) Figure out what to do with `value` and `value = var(col)`

I think this could be handled separately from this PR (which could then just focus on columns). I've moved my thoughts on the value argument of column comparison functions to #494

rich-iannone · 2023-10-21T18:39:12Z

The work you’re doing here is quite significant so please add yourself as an author (but only if you want to be a co-author!).

yjunechoe · 2023-10-21T21:22:31Z

Thanks - yes, I'd be honored to!

yjunechoe · 2023-10-23T19:12:18Z

The basic groundwork for a complete tidyselect integration in the columns argument is now complete!

All user-facing validation functions now pass the columns input through resolve_columns(), enabling the full range of tidyselect expressions. To be specific, tidyselect support has been extended to rows_distinct(), rows_complete(), and col_exists(), with their peculiarities preserved.
During validation-planning, all tidyselect failures are delayed instead of erroring immediately. This preserves the old behavior of 💥-ing at report
Backwards compatibility with vars() is guaranteed, for vars(x), vars("x"), var(!!col) and even some more funky ones I've found in the wild like vars(ends_with("x")).

The remaining issues that'd be nice to resolve as a part of this PR are:

Should the use of vars() in columns signal some deprecation message/warning? (when, how frequent, how specific, etc.)
Are there any changes to the documentation that are in order? I assume we'd want to at least modify this part of the "Column Names" section:

Aside from column names in quotes and in vars(), tidyselect helper functions are available for specifying columns. They are: starts_with(), ends_with(), contains(), matches(), and everything().

I have a few other smaller issues that I'm keeping track of in my mind but they could probably be tackled as separate PRs, so I'll refrain from complicating this PR any further than this.

rich-iannone · 2023-10-23T19:32:01Z

The basic groundwork for a complete tidyselect integration in the columns argument is now complete!

All user-facing validation functions now pass the columns input through resolve_columns(), enabling the full range of tidyselect expressions. To be specific, tidyselect support has been extended to rows_distinct(), rows_complete(), and col_exists(), with their peculiarities preserved.

During validation-planning, all tidyselect failures are delayed instead of erroring immediately. This preserves the old behavior of 💥-ing at report

Backwards compatibility with vars() is guaranteed, for vars(x), vars("x"), var(!!col) and even some more funky ones I've found in the wild like vars(ends_with("x")).

This is really excellent work! It's good that there's backwards compatibility with vars() because I imagine there's a fair amount of deployed code that uses it. And we even write expressions with vars() during output to YAML (see

pointblank/R/yaml_write.R

Line 700 in dc1b917

as_vars_fn <- function(columns) {

) so we need to be able to interpret that for YAML round-tripping.

The remaining issues that'd be nice to resolve as a part of this PR are:

Should the use of vars() in columns signal some deprecation message/warning? (when, how frequent, how specific, etc.)

I think we should instead remove references to vars() in documentation and examples. By having it still work in an un-advertised way means that users will slowly migrate over to vars()-less input. I think after a release or two, we could then signal deprecation.

Are there any changes to the documentation that are in order? I assume we'd want to at least modify this part of the "Column Names" section:

Aside from column names in quotes and in vars(), tidyselect helper functions are available for specifying columns. They are: starts_with(), ends_with(), contains(), matches(), and everything().

We should definitely move over all examples to the preferred form and make sure that the YAML writing is modified to remove its use of vars(). This could happen over a few PRs!

I have a few other smaller issues that I'm keeping track of in my mind but they could probably be tackled as separate PRs, so I'll refrain from complicating this PR any further than this.

Thanks for thinking further ahead on this. Feel free to create issues or just go for it with new PRs!

I'll do a bit of manual testing with this PR, just to see if there are any new issues. Again, thank you for all the work!

yjunechoe · 2023-10-24T01:43:54Z

Gotcha - I'll let you finish your side of tests before I do anything more here (though I think I've put in everything I wanted - happy to move residual issues into their own PRs once this big one is merged).

Please let me know if any part of the code is unclear!

rich-iannone

Excellent work!

Documentation and general cleanup for #493, #495

yjunechoe added 18 commits October 17, 2023 09:13

streamlining internal tidyselect first draft

b09ded9

[temp] make error msg regex test generally for column-related error

5cedd20

keep original deparse handling of vars

2d939aa

be intentional about passing character vector of columns

3f7cce4

remove expr constraint to bring the environment along for ordinary ev…

1f04ebc

…aluation in all_of

work with vars() as expr not quosure since we don't care about the env

85217be

don't use all_of() for ordinarily evaluated column argument

9fcbaed

more explicit code for resolving tbl

e72e49b

exception for when tbl is empty like in specially()

ac65f40

use tidyselect in rows_distinct

d386432

tidyselect tests first draft

1382a08

namespace tidyselect fns in test

099ba5f

revert all_of for create_validation_step, which uses ordinary evaluation

7e55558

use all_of

14f9878

implement behavior of NULL with everything()

4cc1c84

relax regexp for column not found error

91d3102

tests expect column info from row_distinct() validation step

d7c1eff

add test for tidyselect where support

7b3ee39

yjunechoe marked this pull request as draft October 17, 2023 20:45

yjunechoe added 4 commits October 17, 2023 16:52

clean up resolve_columns

9bc5524

use quo_is_null

982d4c5

use quo_is_call

719cf6f

simplify defaulting to everything() when column is NULL for rows_dist…

5bbe101

…inct()

yjunechoe added 2 commits October 19, 2023 10:24

[chore] probe_columns helpers marked for always receiving chr input

646ec8e

[chore] some more all_of coverage

1978e8d

yjunechoe added 2 commits October 19, 2023 14:57

linter

65ae1ad

tidyselect errors only in non-validation-building contexts

a50643d

yjunechoe added 4 commits October 19, 2023 19:50

Resolve to NA for when NULL passed to columns argument

dddfd1e

force tbl/agent in resolve_columns to avoid interrupt promise warnings

af394ab

tests for tidyselect column selection error behaviors

7111ed2

test status in validation set directly

726aed6

yjunechoe mentioned this pull request Oct 20, 2023

A possible switch to tidyeval for values and left/right arguments #494

Closed

2 tasks

[chore] replace .env$ with all_of()

7e492bd

yjunechoe added 10 commits October 21, 2023 18:01

tidyselect columns in rows_complete

bbacbc7

test 0-column selection behavior for rows_* validation functions

56fb16f

add june as aut

4f07e53

use all_of

0585392

clean up tidyselect test file

f07e396

tests expect column info from row_complete() validation step

7b8c66c

tidyselect for col_exists(); resolves rstudio#433

2255b41

test tidyselect for col_exists()

be5725e

more tests for vars() backwards compatibility

8ec5486

col_exists() only fails and never errors for missing column selection

3461c33

yjunechoe marked this pull request as ready for review October 23, 2023 19:19

yjunechoe changed the title ~~A complete tidyselect integration for user-facing validation functions~~ A complete tidyselect integration for the columns argument in validation functions Oct 23, 2023

yjunechoe mentioned this pull request Oct 25, 2023

Enable glue syntax in label to access current column/segment (and possibly others) #495

Merged

3 tasks

rich-iannone approved these changes Oct 28, 2023

View reviewed changes

rich-iannone merged commit ca26d12 into rstudio:main Oct 28, 2023
12 of 13 checks passed

yjunechoe deleted the tidyselect-coverage branch October 28, 2023 18:16

yjunechoe mentioned this pull request Oct 29, 2023

Documentation and general cleanup for #493, #495 #496

Merged

27 tasks

yjunechoe added a commit that referenced this pull request Oct 31, 2023

Merge pull request #496 from yjunechoe/tidyselect-coverage-cleanup

ed6f948

Documentation and general cleanup for #493, #495

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A complete tidyselect integration for the `columns` argument in validation functions #493

A complete tidyselect integration for the `columns` argument in validation functions #493

yjunechoe commented Oct 17, 2023 •

edited

yjunechoe commented Oct 19, 2023 •

edited

yjunechoe commented Oct 19, 2023 •

edited

rich-iannone commented Oct 19, 2023 •

edited

yjunechoe commented Oct 20, 2023 •

edited

rich-iannone commented Oct 21, 2023

yjunechoe commented Oct 21, 2023

yjunechoe commented Oct 23, 2023

rich-iannone commented Oct 23, 2023

yjunechoe commented Oct 24, 2023

rich-iannone left a comment

	# The following agent will perform an interrogation that results
	# in all test units passing in the second validation step, but
	# the first experiences an evaluation error (since column
	# `z` doesn't exist in `small_table`)
	agent <-
	create_agent(tbl = small_table) %>%
	col_vals_not_null(vars("z")) %>%
	col_vals_gt(vars(c), 1, na_pass = TRUE) %>%
	interrogate()

A complete tidyselect integration for the columns argument in validation functions #493

A complete tidyselect integration for the columns argument in validation functions #493

Conversation

yjunechoe commented Oct 17, 2023 • edited

Intro

Problems

Example

Remaining considerations

Deprecation

Coverage and WIP status

Tests

Related GitHub Issues and PRs

Checklist

yjunechoe commented Oct 19, 2023 • edited

yjunechoe commented Oct 19, 2023 • edited

rich-iannone commented Oct 19, 2023 • edited

yjunechoe commented Oct 20, 2023 • edited

1) Make sure that column selection failures are only signaled at interrogate() (when validations are executed against the dataframe), and not while building up the validation steps in an agent.

2) Figure out what to do with value and value = var(col)

rich-iannone commented Oct 21, 2023

yjunechoe commented Oct 21, 2023

yjunechoe commented Oct 23, 2023

rich-iannone commented Oct 23, 2023

yjunechoe commented Oct 24, 2023

rich-iannone left a comment

Choose a reason for hiding this comment

A complete tidyselect integration for the `columns` argument in validation functions #493

A complete tidyselect integration for the `columns` argument in validation functions #493

yjunechoe commented Oct 17, 2023 •

edited

yjunechoe commented Oct 19, 2023 •

edited

yjunechoe commented Oct 19, 2023 •

edited

rich-iannone commented Oct 19, 2023 •

edited

yjunechoe commented Oct 20, 2023 •

edited

1) Make sure that column selection failures are only signaled at `interrogate()` (when validations are executed against the dataframe), and not while building up the validation steps in an agent.

2) Figure out what to do with `value` and `value = var(col)`