Skip to content

re2 regular expressions

Toby Dylan Hocking edited this page Feb 12, 2016 · 16 revisions

Background

Regular expressions are powerful tools for parsing loosely structured text data. Russ Cox’s ”Regular Expression Matching Can Be Simple And Fast” explains that due to backreference support, several common regular expression engines have a time complexity which is exponential in pattern size (including PCRE which is used by R). One way to achieve polynomial time complexity is to drop backreference support and use the re2 C++ library.

Related work

The figures below show timings of these libraries for matching the pattern a?a?a?aaa against the subject aaa (N=3), for several different values of N. Source.

Coding project: library(re2)

The GSOC student should write an R package interface to the re2 C++ code, providing the R community with the first regular expression package with both named capture and polynomial time complexity.

R function library named capture complexity unicode
regexpr(perl=FALSE) TRE no polynomial ???
stringi::stri_match() ICU no exponential yes
regexpr(perl=TRUE) PCRE yes exponential ???
str_match_re2() re2 yes polynomial ???
namedCapture function re2 function finds returns
str_match_named str_match_re2 first match character matrix
str_match_all_named str_match_all_re2 several matches list of character matrices
  • Copy and modify the tests in the namedCapture package.
  • Write an R interface that compiles the pattern and then uses the RE2::PartialMatch C++ function.
  • Write a vignette with extensive benchmarks, covering also real-world text data samples and popular regexes like email, www search, etc.

Expected impact

In general, the re2 package will provide a fast and secure way to extract data from text files using regex. More specifically, the polynomial time complexity of the re2 package will be useful for writing secure web applications that accept user-provided regular expressions. For example, say that you write a shiny app that accepts a user-provided regex. If R passes this regex to existing regex libraries such as PCRE or ICU, then this shiny app is vulnerable to a denial-of-service attack. If the user’s regex is malicious, then it could cause the R on the shiny server to stay busy computing a match for a long time (worst case exponential time complexity). In contrast, the re2 library has a worst case polynomial time complexity so would not suffer from such attacks.

Mentors

Please get in touch with Toby Dylan Hocking <tdhock5@gmail.com> and Marek Gagolewski <gagolewski@rexamine.com> after completing at least one of the tests below.

Tests

Do one or several — doing more hard tests makes you more likely to be selected.

Solutions of tests

Students, please post a link to your test results here.

Clone this wiki locally