extract() does not honor regex options #693

mixolydianpink · 2019-07-26T10:04:18Z

extract() does not honor regex options. Here's a reprex:

# Example data.
df <- data.frame(x="A")
# Extract the first case-insensitive (English) letter.
out <- tidyr::extract(df,
  x,
  c("A"),
  stringr::regex("([a-z])", ignore_case = TRUE)
)

# Expect this to be "A"
out$A
#> [1] NA

I saw #223, but I think this is a different issue.

I'm really not familiar with R as a language or the tidyverse as a system so I'm not sure if this is intentional behavior, but I wasn't able to find documentation that extract() treats the regex as a simple character vector. The documentation for extract() refers to it only as "a regular expression."

The apparent issue is this line in extract.R:

  matches <- stringi::stri_match_first_regex(x, regex)[, -1, drop = FALSE]

I would expect it to pass the options of the regex to stri_match_first_regex using its opts_regex parameter.

As far as I can tell, tidyr doesn't import stringr at all. So extract()s lack of awareness of stringr's regex types could be an intentional design choice of the tidyverse. But that should still be better documented.

I attempted to write a fix with the assumption that use of stringr was allowed, but ran into the problem that stringr's method of getting options from a regex seems to be opts(), which is not exported. In any case, my changes were

# extract.R:57
# Doesn't work. `stringr::opts()` not available.
  matches <- stringi::stri_match_first_regex(x,
    regex,
    opts_regex = stringr::opts(regex)
  )[, -1, drop = FALSE]

# test-extract.R:48
test_that("honors regex options", {
  df <- data.frame(x="A")
  out <- df %>% extract(x, c("A"), stringr::regex("([a-z])", ignore_case = TRUE))
  expect_equal(out$A, "A")
})

The text was updated successfully, but these errors were encountered:

hadley · 2019-09-07T12:48:03Z

I think this is just a documentation issue; due to stringr's use of stringi (which has heavy compilation requirements) we unfortunately can't use stringr everywhere within the tidyverse.

Fixes tidyverse#693 Note that we're using a stringi regular expression. Also note it has to be a string, not something like a stringr::regex

hadley · 2020-04-22T18:56:14Z

My explanation was a bit incomplete since tidyr actually does use stringi — I don't think we want to commit to any specific regular expression engine in the docs, since we might want to change them out in the future.

hadley · 2021-12-17T13:44:07Z

@DavisVaughan I think we can now commit to separate() continuing to use base regexps, so this probably just needs a small documentation update. If you wanted to be extra, in separate() and extract() we could also check that the pattern doesn't have class pattern (for now) or stringr_pattern (for after tidyverse/stringr#384)

hadley added documentation help wanted ❤️ we'd love your help! labels Sep 7, 2019

yutannihilation mentioned this issue Sep 18, 2019

extract doesn't work properly when text contains \n #749

Closed

namelessjon added a commit to namelessjon/tidyr that referenced this issue Oct 19, 2019

Improve documentation for extract()

96e082c

Fixes tidyverse#693 Note that we're using a stringi regular expression. Also note it has to be a string, not something like a stringr::regex

namelessjon mentioned this issue Oct 19, 2019

Improve documentation for extract() #785

Closed

This comment has been minimized.

Sign in to view

hadley added the strings 🎻 label Dec 17, 2021

DavisVaughan mentioned this issue Dec 20, 2021

Update extract() documentation and disallow stringr patterns #1283

Merged

DavisVaughan closed this as completed in #1283 Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract() does not honor regex options #693

extract() does not honor regex options #693

mixolydianpink commented Jul 26, 2019

hadley commented Sep 7, 2019

hadley commented Apr 22, 2020

This comment has been minimized.

hadley commented Dec 17, 2021

extract() does not honor regex options #693

extract() does not honor regex options #693

Comments

mixolydianpink commented Jul 26, 2019

hadley commented Sep 7, 2019

hadley commented Apr 22, 2020

This comment has been minimized.

hadley commented Dec 17, 2021