Demo: Creating Regular Expression Matcher's using the haskell bindings to LLVM
Haskell
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Text/RegExp
tests
Grep.hs
Makefile
README
TestWeightedRegexp.hs

README

JIT-compiling code to match regular expressions
===============================================

This is a small demo, showing how to generate regular expression matchers using LLVM.
It is inspired by the great Weighted Regular Expressions paper (http://sebfisch.github.com/haskell-regexp/).

There is no documentation so far, only copies of mailing list communication.

Introduction (28.7.2010)
------------------------
Taking a quick look at the PyPy blog post on JIT code generation for regular expressions, I thought it would be fun to implement a generator using the excellent LLVM bindings for haskell.

The attached prototype does not scale for larger regexp, is mostly untested and probably unoptimized, but is quite a bit faster than ruby and python's re, with the code generation core only spanning 25 lines.

Here is a biased (because the LLVM stuff needs to use bytestrings) and completely unrepresentative comparison for the "even number of c's" regular expressions:

> ruby -e 'print("accbccacbc" * 10000000)' > test.in

> # Using weighted-regexp-0.2.0.0 and the RE from the paper
> time ./TestWeightedRegexp < test.in # using the RE from the paper
       58.34 real        56.65 user         1.10 sys

> time ruby -e 'gets =~ /\A(?:[ab]*c[ab]*c)*[ab]*\Z/' < test.in
        7.42 real         6.23 user         1.03 sys

> time ./Grep '([ab]*c[ab]*c)*[ab]*' < test.in
        0.96 real         0.79 user         0.16 sys

For large regular expressions, the generated bitcode serves as a good stress test for LLVM's backend ;) 

Rough Functionality (30.7.2010)
-------------------------------

The current implementation works as follows:

It generates an LLVM function which matches a bytestring against a given  regular expression. The state of the 'automaton' consists of one bit for
each leaf of the regexp AST, corresponding to the marks in the article's figure. It then generates straight-line code updating the state given an input character (generateRegexpCode and matchCharSet in Text.RegExp.LLVM).

For example, for matching '.a' , the generated code looks like this in a simplified pseudo code notation:

> ... next ~ first character matched
> ... ch   ~ input character
>   next1 = bitmask[0]              -- was '.' marked ?
>   bitmask[0] = next               -- mark '.' if initial
>   next2 = bitmask[1]              -- was 'a' marked ?
>   bitmask[1] = ch == 'a' && next1 -- mark 'a' if '.' was marked                                          
                                    -- and input is 'a'

At the end of the string, code is generated to check whether the automaton is in a final state (genFinalStateCheck). In the example above, this corresponds to

>   final = bitmask[1]

The rest is either LLVM boilerplate (regexMatcher) or adaptions from weighed-regexp. Additionally, I've adapted the Parser.y from weighted-regexp, but some things (e.g. character classes like \w) are not implemented.

Generating one big basic block for the whole regular expressions does not work well when there are more than a few thousand nodes in the AST. Using functions for large regular expressions and loop for repititions would be one solution. The other problem is that the matcher only operates on Word8s at the moment.

Trying it out (if you have llvm+haskell bindings) is also easy (do not cabal-easy):

> $ git clone git@github.com:visq/llvm-regexp.git && cd llvm-regexp
> $ make
> $ ./Grep '.*spotlight.*' < /etc/passwd 
< Line 45 matches