Skip to content

Commit

Permalink
add strcapture() for extracting tokens into a data.frame via capture …
Browse files Browse the repository at this point in the history
…expressions

git-svn-id: https://svn.r-project.org/R/trunk@70406 00db46b3-68df-0310-9c12-caf00c1e9a41
  • Loading branch information
lawrence committed Mar 31, 2016
1 parent 9eb80de commit 7ee1667
Show file tree
Hide file tree
Showing 4 changed files with 101 additions and 1 deletion.
5 changes: 5 additions & 0 deletions doc/NEWS.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,11 @@
method that would be dispatched. A number of internal utilities
were added to support this, most notably
\code{utils::isS3stdGeneric()}. Based on a patch by Gabe Becker.

\item Add \code{utils::strcapture()}. Given a character vector and
a regular expression containing capture expressions,
\code{strcapture()} will extract the captured tokens into a
tabular data structure, typically a data.frame.
}
}
\subsection{DEPRECATED AND DEFUNCT}{
Expand Down
2 changes: 1 addition & 1 deletion src/library/utils/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ export("?", .DollarNames, .S3methods, .romans, CRAN.packages, Rprof,
read.delim, read.delim2, read.fwf, read.fortran, read.socket,
read.table, recover, relist, remove.packages, removeSource,
rtags, savehistory, select.list, sessionInfo, setBreakpoint,
setRepositories, stack, str, strOptions, summaryRprof,
setRepositories, stack, str, strcapture, strOptions, summaryRprof,
suppressForeignCheck, tail, tail.matrix, tar, timestamp,
toBibtex, toLatex, type.convert, undebugcall, unstack, untar, unzip,
update.packageStatus, update.packages, upgrade, url.show, vi,
Expand Down
42 changes: 42 additions & 0 deletions src/library/utils/R/strcapture.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
## File src/library/utils/R/strcapture.R
## Part of the R package, https://www.R-project.org
##
## Copyright (C) 1995-2016 The R Core Team
##
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2 of the License, or
## (at your option) any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
##
## A copy of the GNU General Public License is available at
## https://www.R-project.org/Licenses/

strcapture <- function(pattern, x, proto, perl = FALSE, useBytes = FALSE) {
m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
str <- regmatches(x, m)
ntokens <- length(proto) + 1L
if (!all(lengths(str) == ntokens)) {
stop("number of matches does not always match ncol(proto)")
}
mat <- matrix(as.character(unlist(str)), ncol=ntokens,
byrow=TRUE)[,-1L,drop=FALSE]
ans <- lapply(seq_along(proto), function(i) {
if (isS4(proto[[i]])) {
methods::as(mat[,i], class(proto[[i]]))
} else {
fun <- match.fun(paste0("as.", class(proto[[i]])))
fun(mat[,i])
}
})
names(ans) <- names(proto)
if (isS4(proto)) {
methods::as(ans, class(proto))
} else {
as.data.frame(ans)
}
}
53 changes: 53 additions & 0 deletions src/library/utils/man/strcapture.Rd
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
\name{strcapture}
\alias{strcapture}
\title{
Capture string tokens into a data.frame
}
\description{
Given a character vector and a regular expression containing capture
expressions, \code{strcapture} will extract the captured tokens into a
tabular data structure, such as a data.frame, the type and structure of
which is specified by a prototype object. The assumption is that the
same number of tokens are captured from every input string.
}
\usage{
strcapture(pattern, x, proto, perl = FALSE, useBytes = FALSE)
}
\arguments{
\item{pattern}{
The regular expression with the capture expressions.
}
\item{x}{
A character vector in which to capture the tokens.
}
\item{proto}{
A \code{data.frame} or S4 object that behaves like one. See details.
}
\item{perl,useBytes}{
Arguments passed to \code{\link{regexec}}.
}
}
\details{
The \code{proto} argument is typically a \code{data.frame}, with a
column corresponding to each capture expression, in order. The
captured character vector is coerced to the type of the column, and
the column names are carried over to the return value. Any data in the
prototype are ignored. See the examples.
}
\value{
A tabular data structure of the same type as \code{proto}, so
typically a \code{data.frame}, containing a column for each capture
expression. The column types and names are inherited from
\code{proto}.
}
\seealso{
\code{\link{regexec}} and \code{\link{regmatches}} for related
low-level utilities.
}
\examples{
x <- "chr1:1-1000"
pattern <- "(.*?):([[:digit:]]+)-([[:digit:]]+)"
proto <- data.frame(chr=character(), start=integer(), end=integer())
strcapture(pattern, x, proto)
}
\keyword{utilities}

0 comments on commit 7ee1667

Please sign in to comment.