Skip to content

named capture #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions R/match.r
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,18 @@ str_match <- function(string, pattern) {

if (length(string) == 0) return(character())

matcher <- re_call("regexec", string, pattern)
if(!is.perl(pattern)){
matcher <- re_call("regexec", string, pattern)
}else{
m <- re_call("regexpr", string, pattern)
matcher <- lapply(seq_along(m),function(i){
structure(c(m[i],attr(m,"capture.start")[i,]),
match.length=c(attr(m,"match.length")[i],attr(m,"capture.length")[i,]))
})
}

matches <- regmatches(string, matcher)

# Figure out how many groups there are and coerce into a matrix with
# nmatches + 1 columns
tmp <- str_replace_all(pattern, "\\\\\\(", "")
Expand All @@ -38,7 +47,12 @@ str_match <- function(string, pattern) {
len <- vapply(matches, length, integer(1))
matches[len == 0] <- rep(list(rep(NA_character_, n)), sum(len == 0))

do.call("rbind", matches)
match.matrix <- do.call("rbind", matches)
group.names <- names(matcher[[1]])
if(!is.null(group.names) && !all(group.names=="")){
colnames(match.matrix) <- group.names
}
match.matrix
}

#' Extract all matched groups from a string.
Expand Down
25 changes: 8 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,10 @@
# stringr
# stringr with named capture regular expressions

Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparations tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R. The `stringr` package aims to remedy these problems by providing a clean, modern interface to common string operations.
This package is an enhanced version of hadley/stringr. The only
difference is that when you use `str_match_all(text, perl(regex))` or
`str_match(text, perl(regex))`, the columns for the extracted
subgroups will be named if `regex` is a named capture regular
expression. It also takes advantage of the new fast C code available
starting from R-2.14.

More concretely, `stringr`:

* Processes factors and characters in the same way.

* Gives functions consistent names and arguments.

* Simplifies string operations by eliminating options that you don't need
95% of the time.

* Produces outputs than can easily be used as inputs. This includes ensuring
that missing inputs result in missing outputs, and zero length inputs
result in zero length outputs.

* Completes R's string handling functions with useful functions from other
programming languages.
http://sugiyama-www.cs.titech.ac.jp/~toby/papers/2011-08-16-directlabels-and-regular-expressions-for-useR-2011/2011-useR-named-capture-regexp.pdf
11 changes: 9 additions & 2 deletions inst/tests/test-match.r
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,13 @@ test_that("multiple match works", {
"\\(([0-9]{3})\\) ([0-9]{3}) ([0-9]{4})")
single_matches <- str_match(phones,
"\\(([0-9]{3})\\) ([0-9]{3}) ([0-9]{4})")

expect_that(multi_match[[1]], equals(single_matches))
})
})

test_that("named capture works", {
phones_one <- str_c(phones, collapse = " ")
multi_match <- str_match_all(phones_one,
perl("\\((?<area>[0-9]{3})\\) ([0-9]{3}) ([0-9]{4})"))
expect_equal(colnames(multi_match[[1]])[2], "area")
})