Skip to content
master
Go to file
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
R
 
 
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.org

An R package for named capture regular expressions

testshttps://travis-ci.org/tdhock/namedCapture.png?branch=master
coveragehttps://coveralls.io/repos/tdhock/namedCapture/badge.svg?branch=master&service=github

The namedCapture package provides user-friendly functions for extracting data tables from non-tabular text, using named capture regular expressions, which look like (?P<fruit>orange|apple). That pattern has one group named fruit that matches either orange or apple.

food.vec <- c(one="apple", nope="courgette", two="orange")
namedCapture::str_match_named(food.vec, "(?P<fruit>orange|apple)")
#>      fruit   
#> one  "apple" 
#> nope NA      
#> two  "orange"
namedCapture::str_match_variable(food.vec, fruit="orange|apple")
#>      fruit   
#> one  "apple" 
#> nope NA      
#> two  "orange"

Both results are character matrices with one row for each match and one named column for each capture group. The second version is the preferred syntax, which generates a named capture group for each named R argument. See also: the newer nc package, which provides similar functionality, but also supports the ICU regex engine (in addition to the PCRE and RE2 engines).

Installation

install.packages("namedCapture")
##OR:
if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/namedCapture")

Usage overview

There are five main functions provided in namedCapture:

Extract first matchExtract each match
chr subject + two argumentsstr_match_namedstr_match_all_named
chr subject + variable argsstr_match_variablestr_match_all_variable
df subject + variable argsdf_match_variableNot implemented

The function prefix indicates the type of the first argument, which must contain the subject:

  • str_* means a character vector – each of these functions uses a single named capture regular expression to extract data from a character vector subject.
  • df_* means a data.frame – the df_match_variable function uses a different named capture regular expression to extract data from each of several specified character column subjects.

The function suffix indicates the type of the other arguments (after the first):

  • *_named means three arguments: subject, pattern, functions. The pattern should be a length-1 character vector that contains named capture groups, e.g. “(?P<groupName1>subPattern1)”, read the Old three argument syntax vignette for more info.
  • *_variable means a variable number of arguments in which the pattern is specified using character strings, type conversion functions, and lists. Read the Recommended variable argument syntax vignette for more info about this powerful and user-friendly syntax, which is the suggested way of using namedCapture.

Additional vignettes:

Choice of regex engine

By default, namedCapture uses RE2 if the re2r package is available, and PCRE otherwise.

  • RE2 uses a polynomial time matching algorithm, so can be faster than PCRE (worst case exponential time).
  • RE2 does not support backreferences, but PCRE does.
  • RE2 only supports (?P<groupName>groupPattern) syntax for named groups, whereas PCRE also supports (?<groupName>groupPattern) syntax (without the initial P).

To tell namedCapture that you would like to use PCRE even if RE2 is available, use

options(namedCapture.engine="PCRE")

Named capture regular expressions tutorial

For a more complete introduction to named capture regular expressions in R and Python, see https://github.com/tdhock/regex-tutorial

Related work

See my journal paper about namedCapture for a detailed discussion of R regex packages.

  • revector provides fast C code for a vector of named capture regular expressions (namedCapture and base R and only provide functions for a single regular expression).
  • nc provides functions similar to namedCapture::*_variable but with additional support for the ICU regex engine.

About

An R package for named capture regular expressions

Resources

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.