Skip to content

tdhock/namedCapture-article

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Title: Comparing namedCapture with other R packages for regular expressions

Abstract: Regular expressions are powerful tools for manipulating non-tabular textual data. For many tasks (visualization, machine learning, etc), tables of numbers must be extracted from such data before processing by other R functions. We present the R package namedCapture, which facilitates such tasks by providing a new user-friendly syntax for defining regular expressions in R code. We begin by describing the history of regular expressions and their usage in R. We then describe the new features of the namedCapture package, and provide detailed comparisons with related R packages (rex, stringr, stringi, tidyr, rematch2, re2r).

Article accepted at R Journal. Errata which have been corrected in this local copy:

  • The publised version states “Like TRE, the RE2 library guarantees linear time complexity” which is misleading. Actually TRE does NOT guarantee linear time complexity if the pattern has backreferences. In fact, this is true of any regex engine; because RE2 does not support backreferences, it can guarantee time complexity which is linear in the subject and pattern size.

Main source file to edit is hocking.Rnw and Makefile takes care of creating submission.zip

6 Mar 2019

R-core member Tomas Kalibera has updated R so that it will use linear time algorithms for both substring and gregexpr, starting with R-3.6.0.

figure-timings-all-email.png

27 Feb 2019

  • Some minor fixes, move Figure with C libraries to beginning.
  • main file to edit is now hocking.Rnw and Makefile takes care of creating submission.zip

25 Feb 2019

Submitted. New source file hocking.tex generated manually from HOCKING-namedCapture.Rnw in order to work with R journal submission requirements, via RJwrapper.tex

20 Feb 2019

figure-trackDb-pkgs.R produces the following figure, which shows that applying the gregexpr_perl patch recovers the linear time complexity for the gregexpr function, but the other functions/packages (rematch2, namedCapture, rex) which use it to extract captured sub-patterns are still at quadratic time complexity:

figure-trackDb-pkgs.png

The culprit is the substring function in R, which has time complexity that is quadratic as the number of substrings and the subject text size increases. The code figure-substring-bug.R produces the following figure which shows the quadratic time complexity of the base R substring function, and the linear time complexity of the stringi::stri_sub function:

figure-substring-bug.png

I suspect the issue is again related to repeated calls to strlen in C.

19 Feb 2019

linear-time-gregexpr-perl.patch moves strlen before the while loop, and reduces the time complexity from quadratic (before the patch)

figure-trackDb-Ronly-slow.png

to linear (after applying the patch to R-3.5.1)

figure-trackDb-Ronly-fast.png

Code to reproduce the figure above: figure-trackDb-Ronly.R

15 Feb 2019

Found https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings and http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html maybe give PCRE_NO_UTF8_CHECK flag? https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html

Tried R-2.14 – first version with PCRE named capture support – still slow.

TODO try converting pcre_demo.c to R pkg.

14 Feb 2019

Investigation about R-PCRE slowness.

  • Does not seem to be an issue with PCRE itself – I compiled command line programs that use PCRE1 and PCRE2 (without R) which have asymptotic timings similar to R-TRE.

figure-trackDb-PCRE-R-1-2.png

  • One difference is that pcre_demo.c uses non-zero options, whereas grep.c in R source code always uses options=0. However this does not seem to be the issue because options are only used when a zero-length match is encountered, which does not happen for my data.
  • Another difference is that gregexpr_perl in grep.c in R source code uses a buffer of 1024 matches to begin with, and then doubles it if it is not big enough, whereas the command line pcre_demo.c program uses no such buffers and memory reallocation. However that does not seem to be the issue because gregexpr_Regexc (TRE) in grep.c in R source code does the same thing, and is not slow.
  • A difference between PCRE and TRE in R is that PCRE calls extract_match_and_groups whereas TRE does not. However that does not seem to be the issue, because groups are extracted in the command line program.
  • A difference is that pcre_demo does not study whereas R does. That does not seem to be the issue (changed to NULL in R grep.c but still slow).
  • Is pcre_exec being called with different starts? No…
  • It doesn’t Have to do with the if(capture_count) block because it is slow even when there are no capture groups.
  • What is the difference bewteen regexpr_perl and gregexpr_perl? There does not seem to be any speed issue with regexpr_perl.
  • TODO run a profiler to find out where the slow part is.

13 Feb 2019

Observed that PCRE is much slower than other regex engines for finding all matches in a long string of text. Is this a bug?

Using both R and Python interfaces, I observed that matching is much slower for PCRE:

  • PCRE from R = median 0.57 seconds.
  • TRE from R = median 0.03 seconds.
> stats.dt[subject.size==max(subject.size)]
         expr subject.size    median        q25        q75
1:  perl=TRUE     31622.78 0.5658699 0.54575069 0.57090807
2: perl=FALSE     31622.78 0.0334391 0.03343219 0.03345432
> 

Asymptotic analysis:

figure-trackDb-gregexpr.png

In Python:

tdhock@recycled:~/projects/namedCapture-article(master*)$ python trackDb.py trackDb.txt 
get_list('re2')
         3302 function calls in 0.034 seconds
...
get_list('re')
         3549 function calls (3541 primitive calls) in 0.019 seconds
...
get_list('pcre')
         9789 function calls in 3.809 seconds
...
tdhock@recycled:~/projects/namedCapture-article(master*)$ 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published