Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more advanced patterns in matches() #3209

Closed
mungojam opened this issue Nov 16, 2017 · 6 comments
Closed

Support more advanced patterns in matches() #3209

mungojam opened this issue Nov 16, 2017 · 6 comments

Comments

@mungojam
Copy link

@mungojam mungojam commented Nov 16, 2017

I tried to use matches() to match a slightly complex regular expression but it didn't support it so I had to use stringr instead. It seems that matches() uses grep rather than stringr.

test <- tibble(
                       hh15_test = c(1,2,3), 
                       hh02_test = c(3,3,3), 
                       hh3_test=c(4,4,5), 
                       blah = c(1,2,3)
)

#Doesn't work
select(test, matches("(hh1)|(^(?!hh))"))

#Error in grep(needle, haystack, ...) : 
#  invalid regular expression '(hh1)|(^(?!hh))', reason 'Invalid regexp'

#Works
select(test, str_subset(current_vars(), "(hh1)|(^(?!hh))"))

# A tibble: 3 x 2
#  hh15_test  blah
      <dbl> <dbl>
#1         1     1
#2         2     2
#3         3     3
@batpigandme
Copy link
Member

@batpigandme batpigandme commented Nov 16, 2017

Until this most recent release of the tidyverse package, stringr hadn't been part of the "core" (i.e. wasn't attached unless loaded separately from tidyverse). There's also the dependency factor.

@cderv
Copy link
Contributor

@cderv cderv commented Nov 17, 2017

Hi,

looking at the source code, matches uses grep function to apply the regex and grep uses Extended Regular Expression by default. (see help("regex", package = "base")). It can use PERL regex with perl=T but not activated in dplyr's matches function.
stringr uses another engine called ICU (close to perl syntax). Based on that you'll understand there are severals flavors of regex and all does not have the same functionalities. Here, it seems base R regex doesn't know about negative lookahead - so it does not recognize the regex.

Here is demo codes for illustration of all this
EDIT: Modified reprex to be consistent with @mungojam previous editing

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-17

library(dplyr, warn.conflicts = F)
test <- tibble(
  hh15_test = c(1,2,3), 
  hh02_test = c(3,3,3), 
  hh3_test=c(4,4,5), 
  blah = c(1,2,3)
)
test
#> # A tibble: 3 x 4
#>   hh15_test hh02_test hh3_test  blah
#>       <dbl>     <dbl>    <dbl> <dbl>
#> 1         1         3        4     1
#> 2         2         3        4     2
#> 3         3         3        5     3
# does not recognized
select(test, matches("(hh1)|^(?!hh)"))
#> Error in grep(needle, haystack, ...): expression régulière '(hh1)|^(?!hh)' incorrecte, à cause de 'Invalid regexp'
# does not recognized either
grep("(hh1)|^(?!hh)", names(test))
#> Error in grep("(hh1)|^(?!hh)", names(test)): expression régulière '(hh1)|^(?!hh)' incorrecte, à cause de 'Invalid regexp'
# It is recognized now
grep("(hh1)|^(?!hh)", names(test), perl = T)
#> [1] 1 4
# stringr recognizes it too
stringr::str_subset(names(test), "(hh1)|^(?!hh)")
#> [1] "hh15_test" "blah"

At last, you said that this works
select(test, str_subset(current_vars(), "(hh1)|(?!hh)"))
but it selects all the columns. Is this the intended behaviour ?

In conclusion, your issue raises a question about which flavors dplyr should work with.
Also, maybe regex should work the same around tidyverse now that stringr is part of it and it seems that dplyr differs in term of regex. But as you can use stringr inside dplyr, maybe matches not needed. Let's see what the tidyverse team has to say on all this. I just wanted to add some informations on this issue.

@mungojam
Copy link
Author

@mungojam mungojam commented Nov 17, 2017

Thanks for the digging @cderv. You've also spotted a bug in my simplified reproduction. The second case was actually throwing an error too. It's working now in the second case as expected.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 12, 2017

@hadley: What flavor of regular expressions should we accept in matches()?

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 11, 2018

Moving this to r-lib/tidyselect#58 as matches was factored out of dplyr.

@lock
Copy link

@lock lock bot commented Oct 8, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Oct 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants