An R package for named capture regular expressions
| tests | ![]() |
| coverage |
The namedCapture package provides user-friendly functions for
extracting data tables from non-tabular text, using named capture
regular expressions, which look like (?P<fruit>orange|apple). That
pattern has one group named fruit that matches either orange or
apple.
food.vec <- c(one="apple", nope="courgette", two="orange")
namedCapture::str_match_named(food.vec, "(?P<fruit>orange|apple)")
#> fruit
#> one "apple"
#> nope NA
#> two "orange"
namedCapture::str_match_variable(food.vec, fruit="orange|apple")
#> fruit
#> one "apple"
#> nope NA
#> two "orange"Both results are character matrices with one row for each match and one named column for each capture group. The second version is the preferred syntax, which generates a named capture group for each named R argument. See also: the newer nc package, which provides similar functionality, but also supports the ICU regex engine (in addition to the PCRE and RE2 engines).
Installation
install.packages("namedCapture")
##OR:
if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/namedCapture")Usage overview
There are five main functions provided in namedCapture:
| Extract first match | Extract each match | |
| chr subject + two arguments | str_match_named | str_match_all_named |
| chr subject + variable args | str_match_variable | str_match_all_variable |
| df subject + variable args | df_match_variable | Not implemented |
The function prefix indicates the type of the first argument, which must contain the subject:
str_*means a character vector – each of these functions uses a single named capture regular expression to extract data from a character vector subject.df_*means a data.frame – thedf_match_variablefunction uses a different named capture regular expression to extract data from each of several specified character column subjects.
The function suffix indicates the type of the other arguments (after the first):
*_namedmeans three arguments: subject, pattern, functions. The pattern should be a length-1 character vector that contains named capture groups, e.g. “(?P<groupName1>subPattern1)”, read the Old three argument syntax vignette for more info.*_variablemeans a variable number of arguments in which the pattern is specified using character strings, type conversion functions, and lists. Read the Recommended variable argument syntax vignette for more info about this powerful and user-friendly syntax, which is the suggested way of using namedCapture.
Additional vignettes:
- Comparing verbose regex syntax shows comparisons with
PCRE_EXTENDEDmode and therexR package. - Comparing regex functions for data.frames shows comparisons with the
tidyrR package.
Choice of regex engine
By default, namedCapture uses RE2 if the re2r package is available, and PCRE otherwise.
- RE2 uses a polynomial time matching algorithm, so can be faster than PCRE (worst case exponential time).
- RE2 does not support backreferences, but PCRE does.
- RE2 only supports
(?P<groupName>groupPattern)syntax for named groups, whereas PCRE also supports(?<groupName>groupPattern)syntax (without the initial P).
To tell namedCapture that you would like to use PCRE even if RE2 is available, use
options(namedCapture.engine="PCRE")Named capture regular expressions tutorial
For a more complete introduction to named capture regular expressions in R and Python, see https://github.com/tdhock/regex-tutorial
Related work
See my journal paper about namedCapture for a detailed discussion of R regex packages.
