Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code using startsWith recommended by string_boundary_linter is not functionally identical when data has NAs #1376

Closed
bersbersbers opened this issue Jun 8, 2022 · 10 comments · Fixed by #1379

Comments

@bersbersbers
Copy link

bersbersbers commented Jun 8, 2022

strs <- c("AB", "CD", NA)

# Use startsWith() to detect a fixed initial substring. Doing so is more readable and more efficient.
grepl("^A", strs)  # [1]  TRUE FALSE FALSE

startsWith(strs, "A")  # [1]  TRUE FALSE    NA

I would expect that the warning gives enough details to either

  • suggest a working solution that considers NAs the same as grepl (I am not aware of any short in-line way), or
  • a warning that NA will require special treatment.

I fear that this necessary special treatment will be bad for readability, anyway.

@MichaelChirico
Copy link
Collaborator

surprised I never caught this 😳

I guess that explains #1331 too.

personally I think !is.na(x) & startsWith(x, pattern) better fits a pattern of handling NA explicitly in code (much like the na.rm=FALSE default forces you to be explicit).

we have a similar recommendation for nchar(x)>0 ➡️ nchar(x, keepNA=TRUE) and x %in% 1 ➡️ !is.na(x) & x==1 (both of these linters will be upstreamed in a future release).

so I lean towards just improving the lint message and linter documentation.

@bersbersbers
Copy link
Author

bersbersbers commented Jun 8, 2022

I am amazed that this even works:

!is.na(x) & startsWith(x, pattern)

I am aware of short-circuit evaluation, of course, but I was not aware it's applied element-wise in R. Indeed, c(FALSE, TRUE) & c(NA, NA) gives [1] FALSE NA.

That said, while I was surprised that the result of !is.na(x) & startsWith(x, pattern) is what it is, I agree that the intent of !is.na(x) & startsWith(x, pattern) is very clear.

@MichaelChirico
Copy link
Collaborator

this is good extended boolean logic at work -- NA means "either TRUE or FALSE, but we're not sure which", i.e., there's a value, it's boolean, but identity unknown.

but regardless of identity of y, TRUE & y is always TRUE, hence the result

@bersbersbers
Copy link
Author

but regardless of identity of y, TRUE & y is always TRUE, hence the result

FALSE & y == FALSE, you mean, but agreed otherwise. TRUE & NA == NA.

@MichaelChirico
Copy link
Collaborator

😅 right. TRUE | y is TRUE. mixed them up

@AshesITR
Copy link
Collaborator

AshesITR commented Jun 8, 2022

WDYT about suggesting !is.na(x) & startsWith(x, pattern) as the replacement? It will be quite clear that NA_character_ needs special thought and, it it is unnecessary, just replacing the code by startsWith(x, pattern) is self-evident.

@bersbersbers
Copy link
Author

bersbersbers commented Jun 8, 2022

WDYT about suggesting !is.na(x) & startsWith(x, pattern) as the replacement? It will be quite clear that NA_character_ needs special thought and, it it is unnecessary, just replacing the code by startsWith(x, pattern) is self-evident.

Agree with all of the above. I just wonder if with this suggestion, the readability argument still holds. I find grepl more readable if x is a long expression that needs to be repeated.

Also, what do you want users to do if x is a function call, potentially computationally expensive; and maybe even with side effects? In that case, the proposed code is definitely not more efficient, and maybe even incorrect.

What about tidyr::replace_na(startsWith(x, pattern), FALSE)? I feel one might keep the current suggestion and add something like

If NA values shall return FALSE as in grepl, consider wrapping the startsWith call inside tidyr::replace_na(..., FALSE).

@AshesITR
Copy link
Collaborator

AshesITR commented Jun 8, 2022

Technically, we could also suggest startsWith(x, pattern) %in% TRUE, although that is probably quite arcane usage of %in% - while probably the most efficient we can do in base R.

NB that we also lint str_detect(x, "^A") which does keep NAs...

@MichaelChirico
Copy link
Collaborator

MichaelChirico commented Jun 8, 2022

NB that we also lint str_detect(x, "^A") which does keep NAs...

right, str_detect() is not an issue because it has consistent behavior. Ditto for the substr()/substring() lints.

So I think an allow_grepl option is in order. I would also be in favor of a message for grepl() like

Use !is.na(x) & startsWith(x, string) instead, or, if missingness is not a concern, just startsWith(x, string)

Hopefully this conveys both situations: (1) i have missing values and I expected/wanted grepl() to return NA, which startsWith() accomplishes and (2) i have no missing values, so I prefer the more concise version of just using startsWith()

Technically, we could also suggest startsWith(x, pattern) %in% TRUE

please no 😂

much more practical is assigning the expensive result first:

x <- expensive_computation(...)
!is.na(x) & startsWith(x, pattern)

PS FWIW, I suspect a lot of people are using grepl in dplyr::filter(), base::subset(), or data.table[ filtering, all of which treat NA as FALSE anyway.

@bersbersbers
Copy link
Author

So I think an allow_grepl option is in order.

Sounds good - I think I would turn this on immediately in my code, and I would not be sad if it were on by default :) But either default value is fine.

I would also be in favor of a message for grepl() like

Use !is.na(x) & startsWith(x, string) instead, or, if missingness is not a concern, just startsWith(x, string)

Hopefully this conveys both situations: (1) i have missing values and I expected/wanted grepl() to return NA, which startsWith() accomplishes and (2) i have no missing values, so I prefer the more concise version of just using startsWith()

This message is perfect IMO.

PS FWIW, I suspect a lot of people are using grepl in dplyr::filter(), base::subset(), or data.table[ filtering, all of which treat NA as FALSE anyway.

For the record, I noticed its use in logical indexing of vectors and data frames, where it does make a difference:

strs <- c("AB", "CD", NA)

strs[grepl("^A", strs)] # "AB"
strs[startsWith(strs, "A")] # "AB" NA

as.data.frame(strs)[grepl("^A", strs), ] # "AB"
as.data.frame(strs)[startsWith(strs, "A"), ] # "AB" NA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants