Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad column selections fail gracefully at interrogation #499

Merged
merged 40 commits into from Nov 2, 2023

Conversation

yjunechoe
Copy link
Collaborator

@yjunechoe yjunechoe commented Oct 31, 2023

Following up on #497, this PR ensures that:

  • NULL or missing value for columns cause the step to 💥, unless default is present (ex: everything() in rows*() functions)
  • Accidentally tidyselecting 0-columns small_table %>% ... starts_with("z") also cause the step to 💥
  • Intentionally tidyselecting specific, non-existent columns small_table %>% ... all_of("z") cause the step to 💥 and will also show attempted column z in report

Luckily, this required minimal changes to the code. The lifecycle of 0-column selection 💥 is now the following:

  1. resolve_columns() returns NA for 0-column selections, both in the case of columns=NULL/<empty> and tidyselecting 0-columns. From this point onwards, the two are treated as the same (we can always later recover how 0-column was selected by looking at $validation_set$columns_expr).
  2. create_validation_step() writes NA for the $column column of the validation step
  3. At interrogate(), the NA is read in as-is at each step, and passed down to individual interrogate_*() functions. Previously, it would use this info to skip certain steps (I mistook this for a feature that should be applied to all failure-to-select cases, hence introducing the skipping bug!)
  4. The get_column_as_sym_at_idx() internal called by the interrogate_*() functions allow NA to pass through as-is. Previously, it'd turn NA into "NA"
  5. The NA is caught inside column_validity_has_columns() called at the top of each interrogate_*() function, and the error thrown from there is signaled to the report with a "... yielded no columns" evaluation failure.

Current behavior (will write them into tests later):

devtools::load_all()
agent <- create_agent(~ small_table)

NULL/missing booms

agent %>% 
  col_vals_gt(columns = NULL, value = 5) %>% 
  interrogate()
agent %>% 
  col_vals_gt(value = 5) %>% 
  interrogate()

image

Selecting non-existent column booms

And shows the column attempted to select

agent %>% 
  col_vals_gt(z, 5) %>% 
  interrogate()
agent %>% 
  col_vals_gt(all_of("z"), 5) %>% 
  interrogate()

image

And when tidyselect returns no matches, columns is empty, like in the case of NULL:

agent %>% 
  col_vals_gt(starts_with("z"), 5) %>% 
  interrogate()

image

Selecting existing column succeeds

agent %>% 
  col_vals_gt(a, 5) %>% 
  interrogate()
agent %>% 
  col_vals_gt(all_of("a"), 5) %>% 
  interrogate()
agent %>% 
  col_vals_gt(starts_with("a"), 5) %>% 
  interrogate()

image

Selecting a mix booms selectively

agent %>% 
    col_vals_gt(c(a, z), 5) %>% 
    interrogate()
agent %>% 
    col_vals_gt(c("a", "z"), 5) %>% 
    interrogate()
multicol <- c("a", "z")
agent %>% 
  col_vals_gt(all_of(multicol), 5) %>% 
  interrogate()

image

(NEW) any_of() safely selects only the existing columns

I totally forgot about any_of() and I feel like it does some of the job we want has_columns() to do. Curious what your thoughts are on advertising this as one safe way to select columns in validation functions!

agent %>% 
  col_vals_gt(any_of(multicol), 5) %>% 
  interrogate()
agent %>% 
    col_vals_gt(any_of(c("a", "z")), 5) %>% 
    interrogate()

image

@yjunechoe yjunechoe marked this pull request as draft October 31, 2023 19:34
@yjunechoe yjunechoe changed the title Null column selection fails gracefully at interrogation Bad column selections fail gracefully at interrogation Nov 1, 2023
@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Nov 1, 2023

I'm feeling pretty good about the coverage of bad column selection behaviors so far! I borrowed an existing (now outdated) batch test setup to run various column selection scenarios for all validation functions, organized into three groups:

  1. The col_*() group, which expands multiple columns into multiple steps, 1 column for each step.
  2. The row_*() group, which always returns a single step and also defaults to everything() for missing/NULL
  3. The special col_exists() function, which never 💥 unless columns is NULL/empty.

All column selection failures now show up as 💥 in the report, and skipping behavior has been completely decoupled from columns. The only time columns throws an immediate error is if it's a genuine evaluation error. For example, this errors early at col_vals_lt() and never hits interrogate():

agent %>%
  col_vals_lt(columns = stop("Oh no!"), value = 5) %>%
  interrogate()
#> Error in `col_vals_lt()`:
#> ! Problem while evaluating `stop("Oh no!")`.
#> Caused by error:
#> ! Oh no!

The consequence of this PR is documented in test-tidyselect_fails_safely_batch.R, where I test the following columns expressions:

agent <- create_agent(tbl = small_table[, c("a", "b", "c")])
mixed_cols <- c("a", "z")
select_exprs <- rlang::quos(
  empty            = ,
  null             = NULL,
  exists           = a,
  nonexistent      = z,
  mixed            = c(a, z),
  mixed_all        = all_of(mixed_cols),
  mixed_any        = any_of(mixed_cols),
  empty_tidyselect = starts_with("z")
)

Current behaviors summarized below:


col_*() functions

empty/NULL exists nonexistent mixed mixed_all mixed_any empty_tidyselect
n_steps 1 1 1 2 2 1 1
column NA a z c("a", "z") c("a", "z") a NA
eval_error TRUE FALSE TRUE TRUE TRUE FALSE TRUE

row_*() functions

empty/NULL exists nonexistent mixed mixed_all mixed_any empty_tidyselect
n_steps 1 1 1 1 1 1 1
column a, b, c a z a, z a, z a NA
eval_error FALSE FALSE TRUE TRUE TRUE FALSE TRUE

col_exists() function

empty/NULL exists nonexistent mixed mixed_all mixed_any empty_tidyselect
n_steps 1 1 1 2 2 1 1
column NA a z c("a", "z") c("a", "z") a NA
eval_error TRUE FALSE FALSE FALSE FALSE FALSE FALSE

The PR just tidies up expected behaviors and doesn't introduce any new features. As long as the above looks good, I think that should finally (let's hope!) wrap up the tidyselect integration in columns! (I'll need to check up on the yaml stuff again after this PR, but that should be a simple follow up)

@yjunechoe yjunechoe marked this pull request as ready for review November 1, 2023 15:23
@rich-iannone
Copy link
Member

This is fantastic!

The use of any_of() should definitely be documented and, maybe some day in the future, the tables summarizing the various behaviors should go into an article of some sort (it would make the understanding of how the underlying evaluation works, and people ought to be able to refer to that).

I'm not sure what could/should be done about the columns = NULL of col_exists(). It probably should 💥 in the report with a reason provided. Better than erroring, what do you think?

@yjunechoe
Copy link
Collaborator Author

I'm not sure what could/should be done about the columns = NULL of col_exists(). It probably should 💥 in the report with a reason provided. Better than erroring, what do you think?

I agree! I actually went out of my way to make col_exists() error when columns isn't specified (to be consistent with what we had before). But yes, I think that changing that to 💥 is preferrable and would be more like introducing a safety net than a breaking change.

I just added a columns = NULL default for col_exists(). Now both col_exists() and col_exists(columns = NULL) give us 💥 at report. This also gives us two neat unifying properties across all validation functions:

  • Failure to select a column always leads to 💥 at report, and never errors during validation planning.
  • f() and f(columns = NULL) share the same behavior (💥 for col*() and everything() for row*())

I've edited the table in my prev comment to reflect this change.


One minor aesthetic thing I'd also like to tackle while I'm on this topic is the fact that if 0 columns are selected, the report doesn't tell the user how they got there (it just shows an empty cell for columns). I think in cases of e.g. starts_with("z"), it might be nice to show users the value in $column_expr instead of $column in the report. Again, purely a display thing so this could be a really clean piece of code we just stick inside get_agent_report() right before rendering the gt.

Mini proposal

A step like this that fails to select a column from an expression

create_agent(small_table) %>% 
  col_is_logical(starts_with("z")) %>% 
  interrogate()

Currently renders 💥 with nothing to show for in columns:

image

But could be neat and informative if it instead showed something like this:

image

@rich-iannone
Copy link
Member

Thanks for making these changes. There is now a lot more consistency and the lack thereof (before this PR) definitely tripped up a few users!

Also I really like the proposal for the reporting change. Having the report be more informative is definitely a good thing!

@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Nov 2, 2023

Done!

Now if the user attempts a dynamic column selection but none are found, the report will display the expression instead (colored red if that's part of the eval error).

Some examples at interrogate():

col_is_integer()

create_agent(small_table) %>% 
  col_is_integer() %>% 
  col_is_integer(starts_with("z")) %>% 
  col_is_integer(where(is.integer)) %>% 
  interrogate()

image

rows_distinct()

create_agent(small_table) %>% 
  rows_distinct() %>%
  rows_distinct(starts_with("z")) %>% 
  rows_distinct(where(is.integer)) %>%
  interrogate()

image

col_exists()

create_agent(small_table) %>% 
  col_exists() %>% 
  col_exists(starts_with("z")) %>% 
  col_exists(where(is.integer)) %>% 
  interrogate()

image


That's all I have for this PR! LMK if you have any suggestions on this feature or anything else.

@rich-iannone
Copy link
Member

Whoa!! Yes, this is very good. Very, very good. The use of red for the explodey state is inspired!

rich-iannone
rich-iannone previously approved these changes Nov 2, 2023
Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFGTM!

@rich-iannone
Copy link
Member

Super good work here! As ever, feel free to merge whenever!

@yjunechoe
Copy link
Collaborator Author

Oops didn't realize I dismissed your review by pushing doc/news updates 😅 - thanks for catching that

I'll make new PRs for completeness of tidyselect coverage (for has_columns(), info_columns()) and check up on columns-related yaml business.

@yjunechoe yjunechoe merged commit 5905874 into rstudio:main Nov 2, 2023
13 checks passed
@yjunechoe yjunechoe deleted the null-column-fails branch November 2, 2023 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants