Skip to content
An R package for named capture regular expressions
R Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
inst/extdata
man
tests
vignettes
.Rbuildignore
.gitignore
.travis.yml
CONTRIBUTING.org
DESCRIPTION
NAMESPACE
NEWS
README.org
build.sh

README.org

An R package for named capture regular expressions

testshttps://travis-ci.org/tdhock/namedCapture.png?branch=master
coveragehttps://coveralls.io/repos/tdhock/namedCapture/badge.svg?branch=master&service=github

The namedCapture package provides user-friendly functions for extracting data tables from non-tabular text, using named capture regular expressions, which look like (?P<fruit>orange|apple). That pattern has one group named fruit that matches either orange or apple.

food.vec <- c(one="apple", nope="courgette", two="orange")
namedCapture::str_match_named(food.vec, "(?P<fruit>orange|apple)")
#>      fruit   
#> one  "apple" 
#> nope NA      
#> two  "orange"
namedCapture::str_match_variable(food.vec, fruit="orange|apple")
#>      fruit   
#> one  "apple" 
#> nope NA      
#> two  "orange"

Both results are character matrices with one row for each match and one named column for each capture group. The second version is the preferred syntax, which generates a named capture group for each named R argument. See also: the newer nc package, which provides similar functionality, but also supports the ICU regex engine (in addition to the PCRE and RE2 engines).

Installation

install.packages("namedCapture")
##OR:
if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/namedCapture")

Usage overview

There are five main functions provided in namedCapture:

Extract first matchExtract each match
chr subject + two argumentsstr_match_namedstr_match_all_named
chr subject + variable argsstr_match_variablestr_match_all_variable
df subject + variable argsdf_match_variableNot implemented

The function prefix indicates the type of the first argument, which must contain the subject:

  • str_* means a character vector – each of these functions uses a single named capture regular expression to extract data from a character vector subject.
  • df_* means a data.frame – the df_match_variable function uses a different named capture regular expression to extract data from each of several specified character column subjects.

The function suffix indicates the type of the other arguments (after the first):

  • *_named means three arguments: subject, pattern, functions. The pattern should be a length-1 character vector that contains named capture groups, e.g. “(?P<groupName1>subPattern1)”, read the Old three argument syntax vignette for more info.
  • *_variable means a variable number of arguments in which the pattern is specified using character strings, type conversion functions, and lists. Read the Recommended variable argument syntax vignette for more info about this powerful and user-friendly syntax, which is the suggested way of using namedCapture.

Additional vignettes:

Choice of regex engine

By default, namedCapture uses RE2 if the re2r package is available, and PCRE otherwise.

  • RE2 uses a polynomial time matching algorithm, so can be faster than PCRE (worst case exponential time).
  • RE2 does not support backreferences, but PCRE does.
  • RE2 only supports (?P<groupName>groupPattern) syntax for named groups, whereas PCRE also supports (?<groupName>groupPattern) syntax (without the initial P).

To tell namedCapture that you would like to use PCRE even if RE2 is available, use

options(namedCapture.engine="PCRE")

Named capture regular expressions tutorial

For a more complete introduction to named capture regular expressions in R and Python, see https://github.com/tdhock/regex-tutorial

Related work

See my journal paper about namedCapture for a detailed discussion of R regex packages.

  • revector provides fast C code for a vector of named capture regular expressions (namedCapture and base R and only provide functions for a single regular expression).
  • nc provides functions similar to namedCapture::*_variable but with additional support for the ICU regex engine.
You can’t perform that action at this time.