Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore replacing Zeek's regex engine #426

Closed
rsmmr opened this issue Jun 20, 2019 · 31 comments
Closed

Explore replacing Zeek's regex engine #426

rsmmr opened this issue Jun 20, 2019 · 31 comments
Labels
Area: Regex Complexity: Substantial For the stout of heart. Priority: Blocked Type: Enhancement Type: Project A self-contained project — for example an intern project, a tech evaluation, or prototyping

Comments

@rsmmr
Copy link
Member

rsmmr commented Jun 20, 2019

Zeek is still using a custom regex engine that comes with some limitations. Let's revisit whether we could replace it with a more standard, external library. The answer is not clear unfortunately because we have some requirements that historically other engines had trouble meeting. Here's a stab at collecting our requirements for choosing a new engine:

  • DFA-based (for performace)
  • Support for extracting capture groups
  • Stream API: Feed input in arbitrary chunks, with each result indicating whether regexp has (1) already matched; (2) has not yet matched, but might still match with more input; or (3) will never match anymore.
  • Parallel matching for a set of regexes, with the result telling us which one matched. Parallel matching must scale in terms of memory and CPU to at least a similar size than our current engine (note that that one builds the DFAs incrementally on-demand to save memory).
  • Option to prefer the earliest match over the longest (strictly speaking this is not a Bro requirement right now, but Spicy needs it; and I'd rather use the same library for both).

We have multiple use cases for regexes in Zeek, and need these features in various combinations (e.g., streaming set matching preferring the earliest match).

When I looked around years ago, I didn't see any library doing all this; which is why I ended up writing a custom one that HILTI/Spicy is using. However, that one remains prototypish, and with its own challenges; wouldn't recommend using that for Zeek (nor eventually for Spicy).

@rsmmr rsmmr added Area: Regex Complexity: Substantial For the stout of heart. Type: Project A self-contained project — for example an intern project, a tech evaluation, or prototyping Type: Enhancement labels Jun 20, 2019
@timwoj timwoj self-assigned this Jun 20, 2019
@0xxon
Copy link
Member

0xxon commented Jun 20, 2019

https://rust-leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/ might be helpful (even if a bit dated).

@timwoj
Copy link
Contributor

timwoj commented Jul 12, 2019

Currently Hyperscan is winning in the hunt for a library, supporting everything that @rsmmr requested above plus being able to build regexes at compile time. I'm building a table of supported features at https://docs.google.com/spreadsheets/d/1B5kc1PgvOHF621AtrrbQ1LUPiaVBBWjx1Dc8_leTcmk/edit?usp=sharing

@timwoj
Copy link
Contributor

timwoj commented Jul 12, 2019

@rsmmr I have a couple of questions in the list above. For capture groups, do you want the ability to back-reference or just extract the captures from the resulting match? For earliest match vs longest match, are you just referring to lazy/non-greedy mode?

@rsmmr
Copy link
Member Author

rsmmr commented Jul 12, 2019 via email

@rsmmr
Copy link
Member Author

rsmmr commented Jul 12, 2019

I thought TRE is DFA-based and does support capture groups (iirc, the combination of the two was one of the reasons it was developed in the first place)

@timwoj
Copy link
Contributor

timwoj commented Jul 12, 2019

I thought TRE is DFA-based and does support capture groups (iirc, the combination of the two was one of the reasons it was developed in the first place)

I don't see any documentation for DFA beyond a single comment in https://github.com/laurikari/tre/blob/6fb7206b935b35814c5078c20046dbe065435363/lib/tre-match-backtrack.c, and that's just for backtracking. I assume if they support backtracking, they must support capture groups as well.

I looked again and did find support for lazy mode as well. If the submatch stuff in regmatch_t means it supports capture groups, the only thing missing from TRE is parallel matching.

@timwoj
Copy link
Contributor

timwoj commented Jul 12, 2019

That leaves three libraries all missing one thing on the spreadsheet, and all of them missing something different. Is there anything on that list that could be skipped in favor of something else?

In regards to syntax compatibility, we could probably write a simple wrapper over whatever library we end up choosing.

@timwoj
Copy link
Contributor

timwoj commented Jul 12, 2019

A combination of Hyperscan and PCRE might be able to get us all the way there, as described in intel/hyperscan#17.

@timwoj
Copy link
Contributor

timwoj commented Jul 12, 2019

I added another library called Chimera, which is basically Hyperscan+PCRE. It supports capture groups which Hyperscan didn't, but it loses support for Hyperscan's streaming mode.

@rsmmr
Copy link
Member Author

rsmmr commented Jul 19, 2019

Found the source for TRE's tagged DFA approach: http://laurikari.net/ville/regex-submatch.pdf

This is btw the approach that HILTI's prototype regex library takes as well (but I'm not suggesting to use that, hoping to get rid of it :)

@timwoj
Copy link
Contributor

timwoj commented Jul 29, 2019

Updated the chart to note that RE2 does have a stream API through the Consume and FindAndConsume methods. It's now the only engine on the chart to support everything we wanted.

@rsmmr
Copy link
Member Author

rsmmr commented Jul 31, 2019

That's good news!

@rsmmr
Copy link
Member Author

rsmmr commented Jul 31, 2019

I looked more at the Consume methods and don't think they actually do what we need: they match the same regex repeatedly against subsequent instances of input; they don't match a single regex continuously against a single stream of bytes. Unless I'm missing something?

@sethhall just had a good thought though: We should see if RE2 comes with an example of how they would implement something like "grep" against unbounded input: if one pipes 100G of data into an RE2-based "grep", how would the code look like to make that work?

@timwoj
Copy link
Contributor

timwoj commented Jul 31, 2019

From https://github.com/google/re2/blob/master/re2/re2.h:

// The "Consume" operation may be useful if you want to repeatedly
// match regular expressions at the front of a string and skip over
// them as they match. This requires use of the "StringPiece" type,
// which represents a sub-range of a real string.

The way I read that it sounds like it matches starting at a point in a string, then skips ahead of that match and continues matching from the next starting point.

We should see if RE2 comes with an example of how they would implement something like "grep" against unbounded input: if one pipes 100G of data into an RE2-based "grep", how would the code look like to make that work?

Can you link me to an example of how we would do that in the current Zeek code? It looks like RE always takes a bounded string (in that we call strlen on it) for calls to Match, and RE2 does the same thing with the StringPiece constructor. I'm always happy to write up some POC code for this.

@rsmmr
Copy link
Member Author

rsmmr commented Jul 31, 2019

The way I read that it sounds like it matches starting at a point in a string, then skips ahead of that match and continues matching from the next starting point.

Yes, but it resets the regexp state to match from the expression's beginning again.

Their example is parsing lines of var = value: Consume gets each line, and each time it matches the full RE "(\\w+) = (\\d+)\n". What we would need is an API where the regex is matched only once, but we can feed in a sequence of successive data piece like var, =, value.

Can you link me to an example of how we would do that in the current Zeek code?

Look at the RE_Match_State class in RE.h. That keeps the current DFA state while a regex is matched against a sequence of input chunks that're being fed in through the Match() method.

@sethhall
Copy link
Member

sethhall commented Aug 1, 2019

Yeah, I think you're right Robin. It looks like match state isn't separated out in the RE2 api at all. As Tim pointed out earlier though, Hyperscan does that and appears to have an incremental/streaming api that would work for us. Perhaps that's the direction we should go?

@data-man
Copy link

data-man commented Aug 1, 2019

Yet another library: PIRE, GPL 3.

@sethhall
Copy link
Member

sethhall commented Aug 1, 2019

Unfortunately PIRE is distributed under an incompatible license.

@timwoj
Copy link
Contributor

timwoj commented Aug 1, 2019

Yeah, I think you're right Robin. It looks like match state isn't separated out in the RE2 api at all.

That was my misunderstanding of what we actually needed from streaming matches. Now that I'm more clear on it, I completely agree that it doesn't meet the need. I marked that column back to red and made a note on the sheet about it.

@timwoj
Copy link
Contributor

timwoj commented Aug 1, 2019

I opened a GHI on the RE2 repo about whether or not they support a stream API like we need.

@0xxon
Copy link
Member

0xxon commented Aug 1, 2019

Issue link: google/re2#213

@0xxon
Copy link
Member

0xxon commented Aug 1, 2019

I just found google/re2#204, which indicates that re2 does not support capture groups in DFA mode. If I do not misread this, I assume that re2 is dead for our use in any case - since I think we wanted to use it in DFA mode.

@timwoj
Copy link
Contributor

timwoj commented Aug 2, 2019

The response from the re2 project is that they do not support a streaming mode at all. I agree that re2 is dead for our purposes.

@timwoj
Copy link
Contributor

timwoj commented Aug 7, 2019

I think we've reached an impasse on this issue where we can't find a library that sufficiently meets our needs. If no one has any other comments, we can shelve this issue and revisit it some point in the future when there may be a library that does cover what we need.

EDIT: After talking to @rsmmr, we decided to shelve this issue for the time being. We don't want to implement something using a library that doesn't support everything we want, and then have something new come along that does and require a bunch of rework. I'll leave the issue open so it doesn't get lost, and we can revisit it later.

@timwoj timwoj removed their assignment Aug 8, 2019
@timwoj
Copy link
Contributor

timwoj commented Aug 27, 2020

I revisited all of the libraries in the google doc again the other day and nothing has changed with any of them in regards to what we need. I'm going to close this for now.

@timwoj timwoj closed this as completed Aug 27, 2020
@data-man
Copy link

I think RE-flex (documentation) is more suitable than others.

@timwoj
Copy link
Contributor

timwoj commented Aug 27, 2020

@data-man Thanks, I hadn't seen that one yet. It looks like it supports most of the features, but depending on what matcher you use it may lose one or two of them. For example if you use PCRE matcher, you get DFA but not capture groups.

@jasonlue
Copy link
Contributor

jasonlue commented Nov 9, 2020

@timwoj

My company (icebrg.io, now part of gigamon) actually replaced zeek/bro regex with hyperscan for MIME detections for a few years... I worked on related projects.

The huge downside of hyperscan, is that it in reality doesn't support capture groups. For each match, you only get the end of the match (to), not the from of the match. (It supports a flag to enable returning from, but the flag makes a lot of regex expressions fail to compile.) In some applications we have to use hyperscan to find the match end, and then use other packages to figure out the capture group.

-jason

@rsmmr
Copy link
Member Author

rsmmr commented Nov 10, 2020

Yeah, lack of capture groups is one downside of hyperscan. Also, I think their optimizations at least are Intel-CPU specific, are there other portability issues?

@istiak101
Copy link

Yeah, lack of capture groups is one downside of hyperscan. Also, I think their optimizations at least are Intel-CPU specific, are there other portability issues?

Now there is VectorScan.
https://github.com/VectorCamp/vectorscan

@timwoj
Copy link
Contributor

timwoj commented Jun 15, 2022

Now there is VectorScan. https://github.com/VectorCamp/vectorscan

Oh nice, there's a version of Chimera there too that supports capture groups. It's possible that VectorScan supports everything we want it to.

@zeek zeek locked and limited conversation to collaborators Aug 16, 2022
@bbannier bbannier converted this issue into discussion #2342 Aug 16, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Area: Regex Complexity: Substantial For the stout of heart. Priority: Blocked Type: Enhancement Type: Project A self-contained project — for example an intern project, a tech evaluation, or prototyping
Projects
None yet
Development

No branches or pull requests

7 participants