Bad column selections fail gracefully at interrogation #499

yjunechoe · 2023-10-31T19:34:50Z

Following up on #497, this PR ensures that:

NULL or missing value for columns cause the step to 💥, unless default is present (ex: everything() in rows*() functions)
Accidentally tidyselecting 0-columns small_table %>% ... starts_with("z") also cause the step to 💥
Intentionally tidyselecting specific, non-existent columns small_table %>% ... all_of("z") cause the step to 💥 and will also show attempted column z in report

Luckily, this required minimal changes to the code. The lifecycle of 0-column selection 💥 is now the following:

resolve_columns() returns NA for 0-column selections, both in the case of columns=NULL/<empty> and tidyselecting 0-columns. From this point onwards, the two are treated as the same (we can always later recover how 0-column was selected by looking at $validation_set$columns_expr).
create_validation_step() writes NA for the $column column of the validation step
At interrogate(), the NA is read in as-is at each step, and passed down to individual interrogate_*() functions. Previously, it would use this info to skip certain steps (I mistook this for a feature that should be applied to all failure-to-select cases, hence introducing the skipping bug!)
The get_column_as_sym_at_idx() internal called by the interrogate_*() functions allow NA to pass through as-is. Previously, it'd turn NA into "NA"
The NA is caught inside column_validity_has_columns() called at the top of each interrogate_*() function, and the error thrown from there is signaled to the report with a "... yielded no columns" evaluation failure.

Current behavior (will write them into tests later):

devtools::load_all()
agent <- create_agent(~ small_table)

NULL/missing booms

agent %>% 
  col_vals_gt(columns = NULL, value = 5) %>% 
  interrogate()
agent %>% 
  col_vals_gt(value = 5) %>% 
  interrogate()

Selecting non-existent column booms

And shows the column attempted to select

agent %>% 
  col_vals_gt(z, 5) %>% 
  interrogate()
agent %>% 
  col_vals_gt(all_of("z"), 5) %>% 
  interrogate()

And when tidyselect returns no matches, columns is empty, like in the case of NULL:

agent %>% 
  col_vals_gt(starts_with("z"), 5) %>% 
  interrogate()

Selecting existing column succeeds

agent %>% 
  col_vals_gt(a, 5) %>% 
  interrogate()
agent %>% 
  col_vals_gt(all_of("a"), 5) %>% 
  interrogate()
agent %>% 
  col_vals_gt(starts_with("a"), 5) %>% 
  interrogate()

Selecting a mix booms selectively

agent %>% 
    col_vals_gt(c(a, z), 5) %>% 
    interrogate()
agent %>% 
    col_vals_gt(c("a", "z"), 5) %>% 
    interrogate()
multicol <- c("a", "z")
agent %>% 
  col_vals_gt(all_of(multicol), 5) %>% 
  interrogate()

(NEW) `any_of()` safely selects only the existing columns

I totally forgot about any_of() and I feel like it does some of the job we want has_columns() to do. Curious what your thoughts are on advertising this as one safe way to select columns in validation functions!

agent %>% 
  col_vals_gt(any_of(multicol), 5) %>% 
  interrogate()
agent %>% 
    col_vals_gt(any_of(c("a", "z")), 5) %>% 
    interrogate()

…ht as no column selection failure

…g one

…own again

…olumns

yjunechoe · 2023-11-01T15:23:09Z

I'm feeling pretty good about the coverage of bad column selection behaviors so far! I borrowed an existing (now outdated) batch test setup to run various column selection scenarios for all validation functions, organized into three groups:

The col_*() group, which expands multiple columns into multiple steps, 1 column for each step.
The row_*() group, which always returns a single step and also defaults to everything() for missing/NULL
The special col_exists() function, which never 💥 unless columns is NULL/empty.

All column selection failures now show up as 💥 in the report, and skipping behavior has been completely decoupled from columns. The only time columns throws an immediate error is if it's a genuine evaluation error. For example, this errors early at col_vals_lt() and never hits interrogate():

agent %>%
  col_vals_lt(columns = stop("Oh no!"), value = 5) %>%
  interrogate()
#> Error in `col_vals_lt()`:
#> ! Problem while evaluating `stop("Oh no!")`.
#> Caused by error:
#> ! Oh no!

The consequence of this PR is documented in test-tidyselect_fails_safely_batch.R, where I test the following columns expressions:

agent <- create_agent(tbl = small_table[, c("a", "b", "c")])
mixed_cols <- c("a", "z")
select_exprs <- rlang::quos(
  empty            = ,
  null             = NULL,
  exists           = a,
  nonexistent      = z,
  mixed            = c(a, z),
  mixed_all        = all_of(mixed_cols),
  mixed_any        = any_of(mixed_cols),
  empty_tidyselect = starts_with("z")
)

Current behaviors summarized below:

`col_*()` functions

	empty/NULL	exists	nonexistent	mixed	mixed_all	mixed_any	empty_tidyselect
n_steps	1	1	1	2	2	1	1
column	NA	a	z	c("a", "z")	c("a", "z")	a	NA
eval_error	TRUE	FALSE	TRUE	TRUE	TRUE	FALSE	TRUE

`row_*()` functions

	empty/NULL	exists	nonexistent	mixed	mixed_all	mixed_any	empty_tidyselect
n_steps	1	1	1	1	1	1	1
column	a, b, c	a	z	a, z	a, z	a	NA
eval_error	FALSE	FALSE	TRUE	TRUE	TRUE	FALSE	TRUE

`col_exists()` function

	empty/NULL	exists	nonexistent	mixed	mixed_all	mixed_any	empty_tidyselect
n_steps	1	1	1	2	2	1	1
column	NA	a	z	c("a", "z")	c("a", "z")	a	NA
eval_error	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE

The PR just tidies up expected behaviors and doesn't introduce any new features. As long as the above looks good, I think that should finally (let's hope!) wrap up the tidyselect integration in columns! (I'll need to check up on the yaml stuff again after this PR, but that should be a simple follow up)

rich-iannone · 2023-11-01T19:32:20Z

This is fantastic!

The use of any_of() should definitely be documented and, maybe some day in the future, the tables summarizing the various behaviors should go into an article of some sort (it would make the understanding of how the underlying evaluation works, and people ought to be able to refer to that).

I'm not sure what could/should be done about the columns = NULL of col_exists(). It probably should 💥 in the report with a reason provided. Better than erroring, what do you think?

yjunechoe · 2023-11-01T20:43:40Z

I'm not sure what could/should be done about the columns = NULL of col_exists(). It probably should 💥 in the report with a reason provided. Better than erroring, what do you think?

I agree! I actually went out of my way to make col_exists() error when columns isn't specified (to be consistent with what we had before). But yes, I think that changing that to 💥 is preferrable and would be more like introducing a safety net than a breaking change.

I just added a columns = NULL default for col_exists(). Now both col_exists() and col_exists(columns = NULL) give us 💥 at report. This also gives us two neat unifying properties across all validation functions:

Failure to select a column always leads to 💥 at report, and never errors during validation planning.
f() and f(columns = NULL) share the same behavior (💥 for col*() and everything() for row*())

I've edited the table in my prev comment to reflect this change.

One minor aesthetic thing I'd also like to tackle while I'm on this topic is the fact that if 0 columns are selected, the report doesn't tell the user how they got there (it just shows an empty cell for columns). I think in cases of e.g. starts_with("z"), it might be nice to show users the value in $column_expr instead of $column in the report. Again, purely a display thing so this could be a really clean piece of code we just stick inside get_agent_report() right before rendering the gt.

Mini proposal

A step like this that fails to select a column from an expression

create_agent(small_table) %>% 
  col_is_logical(starts_with("z")) %>% 
  interrogate()

Currently renders 💥 with nothing to show for in columns:

But could be neat and informative if it instead showed something like this:

rich-iannone · 2023-11-01T20:50:05Z

Thanks for making these changes. There is now a lot more consistency and the lack thereof (before this PR) definitely tripped up a few users!

Also I really like the proposal for the reporting change. Having the report be more informative is definitely a good thing!

yjunechoe · 2023-11-02T02:59:11Z

Done!

Now if the user attempts a dynamic column selection but none are found, the report will display the expression instead (colored red if that's part of the eval error).

Some examples at interrogate():

`col_is_integer()`

create_agent(small_table) %>% 
  col_is_integer() %>% 
  col_is_integer(starts_with("z")) %>% 
  col_is_integer(where(is.integer)) %>% 
  interrogate()

`rows_distinct()`

create_agent(small_table) %>% 
  rows_distinct() %>%
  rows_distinct(starts_with("z")) %>% 
  rows_distinct(where(is.integer)) %>%
  interrogate()

`col_exists()`

create_agent(small_table) %>% 
  col_exists() %>% 
  col_exists(starts_with("z")) %>% 
  col_exists(where(is.integer)) %>% 
  interrogate()

That's all I have for this PR! LMK if you have any suggestions on this feature or anything else.

rich-iannone · 2023-11-02T03:11:02Z

Whoa!! Yes, this is very good. Very, very good. The use of red for the explodey state is inspired!

rich-iannone

LFGTM!

rich-iannone · 2023-11-02T03:13:13Z

Super good work here! As ever, feel free to merge whenever!

yjunechoe · 2023-11-02T14:04:56Z

Oops didn't realize I dismissed your review by pushing doc/news updates 😅 - thanks for catching that

I'll make new PRs for completeness of tidyselect coverage (for has_columns(), info_columns()) and check up on columns-related yaml business.

yjunechoe added 7 commits October 31, 2023 14:41

column_validity_*() treat NA as no columns selected

e43c378

pass NA down to interrogate functions

432e89f

let NA through from interrogate_*() to column_validity_*() to be caug…

57cb2f0

…ht as no column selection failure

expect interrogation failure for 0 column steps

7a27d4b

comment out tests for skipping behavior when tidyselecting 0 columns

d81da64

correctly signal no columns failure

ac6d349

safe to not use as_label() once column is symbol

a6c8706

yjunechoe marked this pull request as draft October 31, 2023 19:34

yjunechoe added 15 commits October 31, 2023 19:17

move preconditions evaluation outside of trycatch

0b24fc5

if mixed bag of columns, show successful ones first before the failin…

ab11d12

…g one

rows_distinct() passes NA down

a981c5c

remove old post-hoc tidyselect detection in interrogate and pass NA d…

7f9792d

…own again

test rows_distinct() shows empty not "NA" for 0 column tidyselect

97a7b6d

rows_complete() passes NA down

bac9c9d

apply same fix for interrogate_complete()

731ce31

same behavior for both rows_*() functions

992d717

uses_tidyselect() completely factored out

18a88be

col_exists() shouldn't default to everything()

e1ecc37

col_exists(NULL) signals failure

818bdf8

remove batch tests for old tidyselect skipping behavior

59ddbbc

rows_*() defaults to everything() for missing arg too, for completeness

242b140

edit test

6b02564

batch tests for various column selection behaviors

47a4407

yjunechoe changed the title ~~Null column selection fails gracefully at interrogation~~ Bad column selections fail gracefully at interrogation Nov 1, 2023

yjunechoe added 6 commits November 1, 2023 09:43

clean up test file

deb8ba3

rethrow genuine evaluation errors when resolving column selection

794b821

genuine evaluation errors trickle up for col_exists()

4b8ee31

table 0-column tidyselect special case printing in col_exists()

d676d49

add test for rethrowing genuine evaluation errors

f57c457

remove outdated test allowing any evaluation errors to pass through c…

0591147

…olumns

yjunechoe added 3 commits November 1, 2023 10:50

use the more common symbol pattern for columns in test

d08e37e

lintr

6c1de73

ensure that missing column selection gets coded as NA and passed down

1187b2a

yjunechoe marked this pull request as ready for review November 1, 2023 15:23

explicit return from resolve_columns_possible()

502a23e

yjunechoe added 4 commits November 1, 2023 16:19

treat NULL and missing as same on the column_expr side

f6e2722

fix typo

61d2a0a

rollback special casing of missing columns in col_exists()

bb3239c

edit tests

6a7f4d1

yjunechoe added 2 commits November 1, 2023 22:20

display column expr in report if none selected

883962e

clean up column expr display logic and extend support to rows*()

cb7cca8

rich-iannone previously approved these changes Nov 2, 2023

View reviewed changes

yjunechoe added 2 commits November 2, 2023 09:46

document() NULL default in col_exists

da8736c

news item for displaying column expr instead of blank

9aa03dd

yjunechoe dismissed rich-iannone’s stale review via 9aa03dd November 2, 2023 13:53

rich-iannone approved these changes Nov 2, 2023

View reviewed changes

yjunechoe merged commit 5905874 into rstudio:main Nov 2, 2023
13 checks passed

yjunechoe deleted the null-column-fails branch November 2, 2023 14:05

yjunechoe mentioned this pull request Nov 2, 2023

Better handling for bad column selections #497

Closed

yjunechoe mentioned this pull request Nov 12, 2023

Bugfix for col_vals_expr() at report #507

Merged

yjunechoe mentioned this pull request Feb 27, 2024

yaml_write() fails when active argument is a function that returns FALSE #355

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad column selections fail gracefully at interrogation #499

Bad column selections fail gracefully at interrogation #499

yjunechoe commented Oct 31, 2023 •

edited

yjunechoe commented Nov 1, 2023 •

edited

rich-iannone commented Nov 1, 2023

yjunechoe commented Nov 1, 2023

rich-iannone commented Nov 1, 2023

yjunechoe commented Nov 2, 2023 •

edited

rich-iannone commented Nov 2, 2023

rich-iannone left a comment

rich-iannone commented Nov 2, 2023

yjunechoe commented Nov 2, 2023

Bad column selections fail gracefully at interrogation #499

Bad column selections fail gracefully at interrogation #499

Conversation

yjunechoe commented Oct 31, 2023 • edited

Current behavior (will write them into tests later):

NULL/missing booms

Selecting non-existent column booms

Selecting existing column succeeds

Selecting a mix booms selectively

(NEW) any_of() safely selects only the existing columns

yjunechoe commented Nov 1, 2023 • edited

col_*() functions

row_*() functions

col_exists() function

rich-iannone commented Nov 1, 2023

yjunechoe commented Nov 1, 2023

Mini proposal

rich-iannone commented Nov 1, 2023

yjunechoe commented Nov 2, 2023 • edited

col_is_integer()

rows_distinct()

col_exists()

rich-iannone commented Nov 2, 2023

rich-iannone left a comment

Choose a reason for hiding this comment

rich-iannone commented Nov 2, 2023

yjunechoe commented Nov 2, 2023

yjunechoe commented Oct 31, 2023 •

edited

(NEW) `any_of()` safely selects only the existing columns

yjunechoe commented Nov 1, 2023 •

edited

`col_*()` functions

`row_*()` functions

`col_exists()` function

yjunechoe commented Nov 2, 2023 •

edited

`col_is_integer()`

`rows_distinct()`

`col_exists()`