hol-regex

This repo contains a simple, formally verified implementation of a regex engine in Standard ML. The formal verification is done using HOL, with code auto-generated from HOL definitions using the EmitML library. This project was done as part of an interactive theorem proving course at KTH, and is unlikely to be of much practical use.

The regex engine is based on the very nice paper A Play on Regular Expressions by Sebastian Fischer, Frank Huch and Thomas Wilke, published at ICFP 2010 (http://www-ps.informatik.uni-kiel.de/~sebf/pub/regexp-play.html).

Build instructions

Ensure that you have HOL and PolyML installed. Then cd into the ML/ directory, and type Holmake.

This should emit code for a regex implementation, regexML.{sig,sml}, and compile a test program test (from test.sml) that uses this as a library. The test program works roughly like grep: ./test regex < file prints all the lines of file that match the regex (fully; not just contain a matching substring).

Alternatively, ./test without arguments runs a small self-test on the library.

Performance

bench.py does some unscientific performance benchmarking of the three different algorithms implemented by the library, and the built-in regex matcher (regexpMatch.sml) that comes with HOL. Results vary quite a bit from run to run because of garbage collection and similar non-determinism, but still give some hint about real-world performance.

On my 2015 era laptop I get the following results:

Benchmark: |aa*aa| vs |a^n|
---------------
n \ matcher |       slow |     faster |    fastest |    builtin
10                0.0007      0.00019      0.00018       0.0011
20                  0.32      0.00019      0.00018       0.0011
30                     -      0.00018      0.00055       0.0011
100                    -      0.00027      0.00026       0.0011
1000                   -      0.00045      0.00047       0.0018
10000                  -       0.0024       0.0035       0.0014
100000                 -        0.034        0.051       0.0026
1000000                -         0.43         0.48        0.021
10000000               -            -            -         0.18
---------------

(Short regex, long string. slow is terrible; faster and fastest are about equally good. The builtin matcher is 10x faster. The three last all scale linearly.)

Benchmark: |a (a|b)^n a| vs |a^(n+2)|
---------------
n \ matcher |       slow |     faster |    fastest |    builtin
10               0.00014       0.0013      0.00017      0.00086
30                0.0003      0.00019      0.00029       0.0039
100                0.014      0.00076       0.0012        0.064
300                 0.78       0.0053       0.0096          1.2
1000                  29        0.071          0.1           43
3000                   -          0.5         0.77            -
10000                  -          4.8          8.4            -
---------------

(Long regex, long string. slow gets lucky and does surprisingly well. faster and fastest both cope pretty well, scaling quadratically with n. fastest is slightly slower, likely due to generating more garbage. builtin does reasonably up to a point, but is not well optimized for large regexes.)

Benchmark: |(a*)^n a| vs |a^10|
---------------
n \ matcher |       slow |     faster |    fastest |    builtin
10               0.00048      0.00028      0.00028       0.0032
100               0.0011      0.00028      0.00074         0.52
1000               0.012       0.0054       0.0018            -
10000               0.12         0.32         0.06            -
30000               0.67          2.7         0.37            -
---------------

(Long regex, short string. slow again gets lucky. fastest does asymptotically better than faster due to caching (linear vs quadratic in n). As seen before, builtin does not cope well with long regexes.)

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
ML		ML
.gitignore		.gitignore
README.md		README.md
regexEmit.sml		regexEmit.sml
regexScript.sml		regexScript.sml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML

ML

.gitignore

.gitignore

README.md

README.md

regexEmit.sml

regexEmit.sml

regexScript.sml

regexScript.sml

Repository files navigation

hol-regex

Build instructions

Performance

About

Releases

Packages

Languages

simonlindholm/hol-regex

Folders and files

Latest commit

History

Repository files navigation

hol-regex

Build instructions

Performance

About

Resources

Stars

Watchers

Forks

Languages