Explore replacing Zeek's regex engine #426

rsmmr · 2019-06-20T15:06:53Z

Zeek is still using a custom regex engine that comes with some limitations. Let's revisit whether we could replace it with a more standard, external library. The answer is not clear unfortunately because we have some requirements that historically other engines had trouble meeting. Here's a stab at collecting our requirements for choosing a new engine:

DFA-based (for performace)
Support for extracting capture groups
Stream API: Feed input in arbitrary chunks, with each result indicating whether regexp has (1) already matched; (2) has not yet matched, but might still match with more input; or (3) will never match anymore.
Parallel matching for a set of regexes, with the result telling us which one matched. Parallel matching must scale in terms of memory and CPU to at least a similar size than our current engine (note that that one builds the DFAs incrementally on-demand to save memory).
Option to prefer the earliest match over the longest (strictly speaking this is not a Bro requirement right now, but Spicy needs it; and I'd rather use the same library for both).

We have multiple use cases for regexes in Zeek, and need these features in various combinations (e.g., streaming set matching preferring the earliest match).

When I looked around years ago, I didn't see any library doing all this; which is why I ended up writing a custom one that HILTI/Spicy is using. However, that one remains prototypish, and with its own challenges; wouldn't recommend using that for Zeek (nor eventually for Spicy).

0xxon · 2019-06-20T21:36:52Z

https://rust-leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/ might be helpful (even if a bit dated).

timwoj · 2019-07-12T00:38:24Z

Currently Hyperscan is winning in the hunt for a library, supporting everything that @rsmmr requested above plus being able to build regexes at compile time. I'm building a table of supported features at https://docs.google.com/spreadsheets/d/1B5kc1PgvOHF621AtrrbQ1LUPiaVBBWjx1Dc8_leTcmk/edit?usp=sharing

timwoj · 2019-07-12T16:39:41Z

@rsmmr I have a couple of questions in the list above. For capture groups, do you want the ability to back-reference or just extract the captures from the resulting match? For earliest match vs longest match, are you just referring to lazy/non-greedy mode?

rsmmr · 2019-07-12T17:23:45Z

For capture groups, do you want the ability to back-reference or just extract the captures from the resulting match?

Just extract.

For earliest match vs longest match, are you just referring to lazy/non-greedy mode?

Yes, I think so. I have one more criteria: Syntax compability with our current engine. I'm sure we won't find a perfect match, but we should try hard to generally avoid breaking people's scripts when switching out the engine.

…

-- Robin Sommer * Corelight, Inc. * robin@corelight.com * www.corelight.com

rsmmr · 2019-07-12T17:24:59Z

I thought TRE is DFA-based and does support capture groups (iirc, the combination of the two was one of the reasons it was developed in the first place)

timwoj · 2019-07-12T22:33:01Z

I thought TRE is DFA-based and does support capture groups (iirc, the combination of the two was one of the reasons it was developed in the first place)

I don't see any documentation for DFA beyond a single comment in https://github.com/laurikari/tre/blob/6fb7206b935b35814c5078c20046dbe065435363/lib/tre-match-backtrack.c, and that's just for backtracking. I assume if they support backtracking, they must support capture groups as well.

I looked again and did find support for lazy mode as well. If the submatch stuff in regmatch_t means it supports capture groups, the only thing missing from TRE is parallel matching.

timwoj · 2019-07-12T22:36:59Z

That leaves three libraries all missing one thing on the spreadsheet, and all of them missing something different. Is there anything on that list that could be skipped in favor of something else?

In regards to syntax compatibility, we could probably write a simple wrapper over whatever library we end up choosing.

timwoj · 2019-07-12T22:40:24Z

A combination of Hyperscan and PCRE might be able to get us all the way there, as described in intel/hyperscan#17.

timwoj · 2019-07-12T22:45:14Z

I added another library called Chimera, which is basically Hyperscan+PCRE. It supports capture groups which Hyperscan didn't, but it loses support for Hyperscan's streaming mode.

rsmmr · 2019-07-19T18:00:07Z

Found the source for TRE's tagged DFA approach: http://laurikari.net/ville/regex-submatch.pdf

This is btw the approach that HILTI's prototype regex library takes as well (but I'm not suggesting to use that, hoping to get rid of it :)

timwoj · 2019-07-29T19:27:29Z

Updated the chart to note that RE2 does have a stream API through the Consume and FindAndConsume methods. It's now the only engine on the chart to support everything we wanted.

rsmmr · 2019-07-31T04:06:32Z

That's good news!

rsmmr · 2019-07-31T17:20:16Z

I looked more at the Consume methods and don't think they actually do what we need: they match the same regex repeatedly against subsequent instances of input; they don't match a single regex continuously against a single stream of bytes. Unless I'm missing something?

@sethhall just had a good thought though: We should see if RE2 comes with an example of how they would implement something like "grep" against unbounded input: if one pipes 100G of data into an RE2-based "grep", how would the code look like to make that work?

timwoj · 2019-07-31T18:06:28Z

From https://github.com/google/re2/blob/master/re2/re2.h:

// The "Consume" operation may be useful if you want to repeatedly
// match regular expressions at the front of a string and skip over
// them as they match. This requires use of the "StringPiece" type,
// which represents a sub-range of a real string.

The way I read that it sounds like it matches starting at a point in a string, then skips ahead of that match and continues matching from the next starting point.

We should see if RE2 comes with an example of how they would implement something like "grep" against unbounded input: if one pipes 100G of data into an RE2-based "grep", how would the code look like to make that work?

Can you link me to an example of how we would do that in the current Zeek code? It looks like RE always takes a bounded string (in that we call strlen on it) for calls to Match, and RE2 does the same thing with the StringPiece constructor. I'm always happy to write up some POC code for this.

rsmmr · 2019-07-31T22:10:51Z

The way I read that it sounds like it matches starting at a point in a string, then skips ahead of that match and continues matching from the next starting point.

Yes, but it resets the regexp state to match from the expression's beginning again.

Their example is parsing lines of var = value: Consume gets each line, and each time it matches the full RE "(\\w+) = (\\d+)\n". What we would need is an API where the regex is matched only once, but we can feed in a sequence of successive data piece like var, =, value.

Can you link me to an example of how we would do that in the current Zeek code?

Look at the RE_Match_State class in RE.h. That keeps the current DFA state while a regex is matched against a sequence of input chunks that're being fed in through the Match() method.

sethhall · 2019-08-01T03:21:07Z

Yeah, I think you're right Robin. It looks like match state isn't separated out in the RE2 api at all. As Tim pointed out earlier though, Hyperscan does that and appears to have an incremental/streaming api that would work for us. Perhaps that's the direction we should go?

data-man · 2019-08-01T14:11:27Z

Yet another library: PIRE, GPL 3.

sethhall · 2019-08-01T14:49:13Z

Unfortunately PIRE is distributed under an incompatible license.

timwoj · 2019-08-01T16:32:46Z

Yeah, I think you're right Robin. It looks like match state isn't separated out in the RE2 api at all.

That was my misunderstanding of what we actually needed from streaming matches. Now that I'm more clear on it, I completely agree that it doesn't meet the need. I marked that column back to red and made a note on the sheet about it.

timwoj · 2019-08-01T21:22:49Z

I opened a GHI on the RE2 repo about whether or not they support a stream API like we need.

0xxon · 2019-08-01T21:53:24Z

Issue link: google/re2#213

0xxon · 2019-08-01T22:10:35Z

I just found google/re2#204, which indicates that re2 does not support capture groups in DFA mode. If I do not misread this, I assume that re2 is dead for our use in any case - since I think we wanted to use it in DFA mode.

timwoj · 2019-08-02T16:18:12Z

The response from the re2 project is that they do not support a streaming mode at all. I agree that re2 is dead for our purposes.

timwoj · 2019-08-07T21:53:49Z

I think we've reached an impasse on this issue where we can't find a library that sufficiently meets our needs. If no one has any other comments, we can shelve this issue and revisit it some point in the future when there may be a library that does cover what we need.

EDIT: After talking to @rsmmr, we decided to shelve this issue for the time being. We don't want to implement something using a library that doesn't support everything we want, and then have something new come along that does and require a bunch of rework. I'll leave the issue open so it doesn't get lost, and we can revisit it later.

timwoj · 2020-08-27T15:50:53Z

I revisited all of the libraries in the google doc again the other day and nothing has changed with any of them in regards to what we need. I'm going to close this for now.

data-man · 2020-08-27T16:10:01Z

I think RE-flex (documentation) is more suitable than others.

timwoj · 2020-08-27T17:05:08Z

@data-man Thanks, I hadn't seen that one yet. It looks like it supports most of the features, but depending on what matcher you use it may lose one or two of them. For example if you use PCRE matcher, you get DFA but not capture groups.

jasonlue · 2020-11-09T11:17:56Z

@timwoj

My company (icebrg.io, now part of gigamon) actually replaced zeek/bro regex with hyperscan for MIME detections for a few years... I worked on related projects.

The huge downside of hyperscan, is that it in reality doesn't support capture groups. For each match, you only get the end of the match (to), not the from of the match. (It supports a flag to enable returning from, but the flag makes a lot of regex expressions fail to compile.) In some applications we have to use hyperscan to find the match end, and then use other packages to figure out the capture group.

-jason

rsmmr · 2020-11-10T10:03:06Z

Yeah, lack of capture groups is one downside of hyperscan. Also, I think their optimizations at least are Intel-CPU specific, are there other portability issues?

istiak101 · 2022-06-15T19:22:01Z

Yeah, lack of capture groups is one downside of hyperscan. Also, I think their optimizations at least are Intel-CPU specific, are there other portability issues?

Now there is VectorScan.
https://github.com/VectorCamp/vectorscan

timwoj · 2022-06-15T19:33:03Z

Now there is VectorScan. https://github.com/VectorCamp/vectorscan

Oh nice, there's a version of Chimera there too that supports capture groups. It's possible that VectorScan supports everything we want it to.

rsmmr added Area: Regex Complexity: Substantial For the stout of heart. Type: Project A self-contained project — for example an intern project, a tech evaluation, or prototyping Type: Enhancement labels Jun 20, 2019

timwoj self-assigned this Jun 20, 2019

timwoj removed their assignment Aug 8, 2019

rsmmr added the Priority: Blocked label Aug 13, 2020

rsmmr mentioned this issue Aug 13, 2020

Zeek regular expression consumes a lot of memory #450

Closed

timwoj closed this as completed Aug 27, 2020

zeek locked and limited conversation to collaborators Aug 16, 2022

bbannier converted this issue into discussion #2342 Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Explore replacing Zeek's regex engine #426

Explore replacing Zeek's regex engine #426

rsmmr commented Jun 20, 2019

0xxon commented Jun 20, 2019

timwoj commented Jul 12, 2019

timwoj commented Jul 12, 2019

rsmmr commented Jul 12, 2019 via email

rsmmr commented Jul 12, 2019 •

edited

Loading

timwoj commented Jul 12, 2019

timwoj commented Jul 12, 2019

timwoj commented Jul 12, 2019

timwoj commented Jul 12, 2019

rsmmr commented Jul 19, 2019 •

edited

Loading

timwoj commented Jul 29, 2019

rsmmr commented Jul 31, 2019

rsmmr commented Jul 31, 2019

timwoj commented Jul 31, 2019

rsmmr commented Jul 31, 2019

sethhall commented Aug 1, 2019

data-man commented Aug 1, 2019

sethhall commented Aug 1, 2019

timwoj commented Aug 1, 2019 •

edited

Loading

timwoj commented Aug 1, 2019

0xxon commented Aug 1, 2019

0xxon commented Aug 1, 2019

timwoj commented Aug 2, 2019

timwoj commented Aug 7, 2019 •

edited

Loading

timwoj commented Aug 27, 2020

data-man commented Aug 27, 2020

timwoj commented Aug 27, 2020

jasonlue commented Nov 9, 2020

rsmmr commented Nov 10, 2020

istiak101 commented Jun 15, 2022

timwoj commented Jun 15, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Explore replacing Zeek's regex engine #426

Explore replacing Zeek's regex engine #426

Comments

rsmmr commented Jun 20, 2019

0xxon commented Jun 20, 2019

timwoj commented Jul 12, 2019

timwoj commented Jul 12, 2019

rsmmr commented Jul 12, 2019 via email

rsmmr commented Jul 12, 2019 • edited Loading

timwoj commented Jul 12, 2019

timwoj commented Jul 12, 2019

timwoj commented Jul 12, 2019

timwoj commented Jul 12, 2019

rsmmr commented Jul 19, 2019 • edited Loading

timwoj commented Jul 29, 2019

rsmmr commented Jul 31, 2019

rsmmr commented Jul 31, 2019

timwoj commented Jul 31, 2019

rsmmr commented Jul 31, 2019

sethhall commented Aug 1, 2019

data-man commented Aug 1, 2019

sethhall commented Aug 1, 2019

timwoj commented Aug 1, 2019 • edited Loading

timwoj commented Aug 1, 2019

0xxon commented Aug 1, 2019

0xxon commented Aug 1, 2019

timwoj commented Aug 2, 2019

timwoj commented Aug 7, 2019 • edited Loading

timwoj commented Aug 27, 2020

data-man commented Aug 27, 2020

timwoj commented Aug 27, 2020

jasonlue commented Nov 9, 2020

rsmmr commented Nov 10, 2020

istiak101 commented Jun 15, 2022

timwoj commented Jun 15, 2022

This issue was moved to a discussion.

rsmmr commented Jul 12, 2019 •

edited

Loading

rsmmr commented Jul 19, 2019 •

edited

Loading

timwoj commented Aug 1, 2019 •

edited

Loading

timwoj commented Aug 7, 2019 •

edited

Loading