Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite na_if() using vctrs #6329

Merged
merged 5 commits into from
Jul 19, 2022

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jul 12, 2022

We talked about making na_if() work in a match-like way, like na_if(x, c("bad", "NA")), but I no longer think we should do that. While that might be more useful, I think:

  • It would break existing code where people are currently supplying vector y values and expect == to be used, which does seem to come up on a quick GitHub scan

  • It would no longer match NULLIF() from SQL, which it is advertised as mimicking. The Microsoft SQL docs state that NULLIF(e1, e2) should be treated as a "searched CASE expression", meaning that the following should be exactly equivalent

    NULLIF(e1, e2)
    
    CASE  
      WHEN e1 = e2 THEN NULL  
      ELSE e1  
    END 
    

    https://docs.microsoft.com/en-us/sql/t-sql/language-elements/nullif-transact-sql?view=sql-server-ver16#remarks

If people want a match-like na_if(), we could add replace_match(x, c("bad", "NA") ~ NA) as part of #6328


Other notes on what changes from this update:

  • Now casts y to the type of x, with the intention of making it very clear that this is type stable on x
  • Now uses vec_equal() for comparison
  • Now uses vec_assign() for assignment

"9c7db457-e0e3-4d62-8d92-1a8af03dae11"

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jul 13, 2022

These changes broke 10 revdeps (see #6262), but all of them are off-label usage of na_if():

  • 6 are related to doing something like na_if(<tibble>, 0) to replace all instances of 0 across any column of the tibble with NA. This accidentally worked because <tibble> == 0 returns a logical matrix that [<- knows how to deal with. But I consider this off-label usage
  • 4 are expecting a double x and character y (or vice versa) to be compatible, sometimes in very weird ways like: dplyr::na_if(., "NaN") or dplyr::na_if(., "-Inf"), where they seem to be trying to convert . to character before making the comparison

I think we should press forward with this and just make PRs for them since all of them are bugs.

I updated the NEWS bullet to call out the na_if(<tibble>, 0) behavior since that came up so often.

R/na_if.R Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants