nc: named capture
| tests | ![]() |
| coverage |
User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting columns that match a regular expression.
food.vec <- c("granny smith apple", "blood orange")
nc::capture_first_vec(food.vec, type=".*", " ", fruit="orange|apple")
#> type fruit
#> 1: granny smith apple
#> 2: blood orangeInstallation
install.packages("nc")
## or:
if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/nc")Usage overview
Watch the screencast tutorial videos!
The main functions provided in nc are:
| Subject | First match | All matches |
|---|---|---|
| Single string | NA | capture_all_str |
| Character vector | capture_first_vec | NA |
| Data frame chr cols | capture_first_df | NA |
| Data frame col names | capture_melt_single | NA |
| Data frame col names | capture_melt_multiple | NA |
- Vignette 1 discusses
capture_first_vecandcapture_first_df, which capture the first match in each of several subjects (character vector, data frame character columns). - Vignette 2 discusses
capture_all_strwhich captures all matches in a single big multi-line subject string. The vignette also shows how to usecapture_all_stron several different multi-line subject strings, using data.tablebysyntax. - Vignette 3 discusses
capture_melt_singleandcapture_melt_multiplewhich match a regex to the column names of a wide data frame, then melt the matching columns. These functions are especially useful when more than one separate piece of information can be captured from each column name, e.g. the iris column namesPetal.Width,Sepal.Width, etc each have two pieces of information (flower part and measurement dimension). - Vignette 4 shows comparisons with related R packages.
nc::field for reducing repetition
The nc::field function can be used to avoid repetition when defining
patterns of the form variable: value. The example below shows three
(mostly) equivalent ways to write a regex that captures the text after
the colon and space; the captured text is stored in the “variable”
group or output column:
"variable: (?<variable>.*)" #repetitive regex string
list("variable: ", variable=".*")#repetitive nc R code
nc::field("variable", ": ", ".*")#helper function avoids repetitionAnother example:
"Alignment (?<Alignment>[0-9]+)"
list("Alignment ", Alignment="[0-9]+")
nc::field("Alignment", " ", "[0-9]+")Another example:
"Chromosome:\t+(?<Chromosome>.*)"
list("Chromosome:\t+", Chromosome=".*")
nc::field("Chromosome", ":\t+", ".*")nc::quantifier for fewer parentheses
Another helper function is nc::quantifier which makes patterns
easier to read by reducing the number of parentheses required to
define sub-patterns with quantifiers. For example all three patterns
below create an optional non-capturing group which contains a named
capture group:
"(?:-(?<chromEnd>[0-9]+))?" #regex string
list(list("-", chromEnd="[0-9]+"), "?") #nc pattern using lists
nc::quantifier("-", chromEnd="[0-9]+", "?")#quantifier helper functionAnother example with a named capture group inside an optional non-capturing group:
"(?: (?<name>[^,}]+))?"
list(list(" ", name="[^,}]+"), "?")
nc::quantifier(" ", name="[^,}]+", "?")Another example, for non-greedy zero or more lines. In this simple case the regex string literal may be easier to read:
"(?:.*\n)*?"
list(list(".*\n"), "*?")
nc::quantifier(".*\n", "*?")nc::alternatives for simplified alternation
We also provide a helper function for defining regex patterns with alternation. The following three lines are equivalent.
"(?:(?<first>bar+)|(?<second>fo+))"
list(first="bar+", "|", second="fo+")
nc::alternatives(first="bar+", second="fo+")Choice of regex engine
By default, nc uses PCRE. Other options include ICU and RE2.
To tell nc that you would like to use a certain engine,
options(nc.engine="RE2")Every function also has an engine argument, e.g.
nc::capture_first_vec(
"foo a\U0001F60E# bar",
before=".*?",
emoji="\\p{EMOJI_Presentation}",
after=".*",
engine="ICU")
#> before emoji after
#> 1 foo a 😎 # barRelated work
Going forward I recommend using nc rather than namedCapture, which is an older package that provides a similar API:
| namedCapture | nc |
|---|---|
| str_match_variable | capture_first_vec |
| str_match_all_variable | capture_all_str |
| df_match_variable | capture_first_df |
For an overview of these functions, see my
R journal paper
about namedCapture for a usage explanation, and a detailed
comparison with other R regex packages. The main differences between
the functions in nc and namedCapture are:
- Main
ncfunctions all have thecapture_prefix for easy auto-completion. - Internally
ncuses un-named capture groups, whereasnamedCaptureuses named capture groups. This allowsncto support the ICU engine in addition to PCRE and RE2. - Output in
ncis always a data.table (namedCapturefunctions output either a character matrix or a data.frame). nc::capture_first_dfdoes not prefix subject column names to capture group column names, whereasnamedCapture::df_match_variabledoes.- By default the
nc::capture_first_vecstops with an error if any subjects do not match, whereasnamedCapture::str_match_variablereturns NA/missing rows. - Subject names and the capture group named
nameare not treated specially (innamedCapturethey are used for rownames of output). nc::capture_all_stronly supports capturing multiple matches in a single subject, whereasnamedCapture::str_match_all_namedsupports multiple subjects. For multiple subjects, useDT[, nc::capture_all_str(subject), by](see vignette 2 for more info).
There are some new features in nc which are not present in
namedCapture:
nc::capture_melt_singleinputs a data.frame, tries to match a regex to its column names, then melts matching input column names to a single output column.nc::capture_melt_multipleinputs a data.frame, tries to match a regex to its column names, then melts matching input columns to several output columns of different types.- Helper function
nc::fieldis provided for defining patterns (with no repetition) that match subjects like variable=value, and create a column/group named variable. See vignette 2 for more info.
These new features provide functionality similar to packages tidyr,
stats, data.table, reshape, reshape2, cdata, utils, etc. The main
difference is that nc::capture_melt_* support named capture regular
expressions with type conversion, which (1) makes it easier to
create/maintain a complex regex, and (2) results in less repetition in
user code. For a detailed comparison see my paper about nc.
