Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
R
 
 
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.org

nc: named capture

testshttps://travis-ci.org/tdhock/nc.png?branch=master
coveragehttps://coveralls.io/repos/tdhock/nc/badge.svg?branch=master&service=github

User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting columns that match a regular expression.

food.vec <- c("granny smith apple", "blood orange")
nc::capture_first_vec(food.vec, type=".*", " ", fruit="orange|apple")
#>            type  fruit
#> 1: granny smith  apple
#> 2:        blood orange

Installation

install.packages("nc")
## or:
if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/nc")

Usage overview

Watch the screencast tutorial videos!

The main functions provided in nc are:

SubjectFirst matchAll matches
Single stringNAcapture_all_str
Character vectorcapture_first_vecNA
Data frame chr colscapture_first_dfNA
Data frame col namescapture_melt_singleNA
Data frame col namescapture_melt_multipleNA
  • Vignette 1 discusses capture_first_vec and capture_first_df, which capture the first match in each of several subjects (character vector, data frame character columns).
  • Vignette 2 discusses capture_all_str which captures all matches in a single big multi-line subject string. The vignette also shows how to use capture_all_str on several different multi-line subject strings, using data.table by syntax.
  • Vignette 3 discusses capture_melt_single and capture_melt_multiple which match a regex to the column names of a wide data frame, then melt the matching columns. These functions are especially useful when more than one separate piece of information can be captured from each column name, e.g. the iris column names Petal.Width, Sepal.Width, etc each have two pieces of information (flower part and measurement dimension).
  • Vignette 4 shows comparisons with related R packages.

nc::field for reducing repetition

The nc::field function can be used to avoid repetition when defining patterns of the form variable: value. The example below shows three (mostly) equivalent ways to write a regex that captures the text after the colon and space; the captured text is stored in the “variable” group or output column:

"variable: (?<variable>.*)"      #repetitive regex string
list("variable: ", variable=".*")#repetitive nc R code
nc::field("variable", ": ", ".*")#helper function avoids repetition

Another example:

"Alignment (?<Alignment>[0-9]+)"
list("Alignment ", Alignment="[0-9]+")
nc::field("Alignment", " ", "[0-9]+")

Another example:

"Chromosome:\t+(?<Chromosome>.*)"
list("Chromosome:\t+", Chromosome=".*")
nc::field("Chromosome", ":\t+", ".*")

nc::quantifier for fewer parentheses

Another helper function is nc::quantifier which makes patterns easier to read by reducing the number of parentheses required to define sub-patterns with quantifiers. For example all three patterns below create an optional non-capturing group which contains a named capture group:

"(?:-(?<chromEnd>[0-9]+))?"                #regex string
list(list("-", chromEnd="[0-9]+"), "?")    #nc pattern using lists
nc::quantifier("-", chromEnd="[0-9]+", "?")#quantifier helper function

Another example with a named capture group inside an optional non-capturing group:

"(?: (?<name>[^,}]+))?"
list(list(" ", name="[^,}]+"), "?")
nc::quantifier(" ", name="[^,}]+", "?")

Another example, for non-greedy zero or more lines. In this simple case the regex string literal may be easier to read:

"(?:.*\n)*?"
list(list(".*\n"), "*?")
nc::quantifier(".*\n", "*?")

nc::alternatives for simplified alternation

We also provide a helper function for defining regex patterns with alternation. The following three lines are equivalent.

"(?:(?<first>bar+)|(?<second>fo+))"
list(first="bar+", "|", second="fo+")
nc::alternatives(first="bar+", second="fo+")

Choice of regex engine

By default, nc uses PCRE. Other options include ICU and RE2.

To tell nc that you would like to use a certain engine,

options(nc.engine="RE2")

Every function also has an engine argument, e.g.

nc::capture_first_vec(
  "foo a\U0001F60E# bar",
  before=".*?",
  emoji="\\p{EMOJI_Presentation}",
  after=".*",
  engine="ICU")
#>   before emoji after
#> 1  foo a     😎 # bar

Related work

Going forward I recommend using nc rather than namedCapture, which is an older package that provides a similar API:

namedCapturenc
str_match_variablecapture_first_vec
str_match_all_variablecapture_all_str
df_match_variablecapture_first_df

For an overview of these functions, see my R journal paper about namedCapture for a usage explanation, and a detailed comparison with other R regex packages. The main differences between the functions in nc and namedCapture are:

  • Main nc functions all have the capture_ prefix for easy auto-completion.
  • Internally nc uses un-named capture groups, whereas namedCapture uses named capture groups. This allows nc to support the ICU engine in addition to PCRE and RE2.
  • Output in nc is always a data.table (namedCapture functions output either a character matrix or a data.frame).
  • nc::capture_first_df does not prefix subject column names to capture group column names, whereas namedCapture::df_match_variable does.
  • By default the nc::capture_first_vec stops with an error if any subjects do not match, whereas namedCapture::str_match_variable returns NA/missing rows.
  • Subject names and the capture group named name are not treated specially (in namedCapture they are used for rownames of output).
  • nc::capture_all_str only supports capturing multiple matches in a single subject, whereas namedCapture::str_match_all_named supports multiple subjects. For multiple subjects, use DT[, nc::capture_all_str(subject), by] (see vignette 2 for more info).

There are some new features in nc which are not present in namedCapture:

  • nc::capture_melt_single inputs a data.frame, tries to match a regex to its column names, then melts matching input column names to a single output column.
  • nc::capture_melt_multiple inputs a data.frame, tries to match a regex to its column names, then melts matching input columns to several output columns of different types.
  • Helper function nc::field is provided for defining patterns (with no repetition) that match subjects like variable=value, and create a column/group named variable. See vignette 2 for more info.

These new features provide functionality similar to packages tidyr, stats, data.table, reshape, reshape2, cdata, utils, etc. The main difference is that nc::capture_melt_* support named capture regular expressions with type conversion, which (1) makes it easier to create/maintain a complex regex, and (2) results in less repetition in user code. For a detailed comparison see my paper about nc.

About

Named capture regular expressions for text parsing and data reshaping

Resources

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.