-
Notifications
You must be signed in to change notification settings - Fork 1.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore replacing Zeek's regex engine #426
Comments
https://rust-leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/ might be helpful (even if a bit dated). |
Currently Hyperscan is winning in the hunt for a library, supporting everything that @rsmmr requested above plus being able to build regexes at compile time. I'm building a table of supported features at https://docs.google.com/spreadsheets/d/1B5kc1PgvOHF621AtrrbQ1LUPiaVBBWjx1Dc8_leTcmk/edit?usp=sharing |
@rsmmr I have a couple of questions in the list above. For capture groups, do you want the ability to back-reference or just extract the captures from the resulting match? For earliest match vs longest match, are you just referring to lazy/non-greedy mode? |
For capture groups, do you want the ability to back-reference or
just extract the captures from the resulting match?
Just extract.
For earliest match vs longest match, are you just referring to lazy/non-greedy mode?
Yes, I think so.
I have one more criteria: Syntax compability with our current engine.
I'm sure we won't find a perfect match, but we should try hard to
generally avoid breaking people's scripts when switching out the engine.
…--
Robin Sommer * Corelight, Inc. * robin@corelight.com * www.corelight.com
|
I thought TRE is DFA-based and does support capture groups (iirc, the combination of the two was one of the reasons it was developed in the first place) |
I don't see any documentation for DFA beyond a single comment in https://github.com/laurikari/tre/blob/6fb7206b935b35814c5078c20046dbe065435363/lib/tre-match-backtrack.c, and that's just for backtracking. I assume if they support backtracking, they must support capture groups as well. I looked again and did find support for lazy mode as well. If the submatch stuff in |
That leaves three libraries all missing one thing on the spreadsheet, and all of them missing something different. Is there anything on that list that could be skipped in favor of something else? In regards to syntax compatibility, we could probably write a simple wrapper over whatever library we end up choosing. |
A combination of Hyperscan and PCRE might be able to get us all the way there, as described in intel/hyperscan#17. |
I added another library called Chimera, which is basically Hyperscan+PCRE. It supports capture groups which Hyperscan didn't, but it loses support for Hyperscan's streaming mode. |
Found the source for TRE's tagged DFA approach: http://laurikari.net/ville/regex-submatch.pdf This is btw the approach that HILTI's prototype regex library takes as well (but I'm not suggesting to use that, hoping to get rid of it :) |
Updated the chart to note that RE2 does have a stream API through the |
That's good news! |
I looked more at the @sethhall just had a good thought though: We should see if RE2 comes with an example of how they would implement something like "grep" against unbounded input: if one pipes 100G of data into an RE2-based "grep", how would the code look like to make that work? |
From https://github.com/google/re2/blob/master/re2/re2.h:
The way I read that it sounds like it matches starting at a point in a string, then skips ahead of that match and continues matching from the next starting point.
Can you link me to an example of how we would do that in the current Zeek code? It looks like RE always takes a bounded string (in that we call |
Yes, but it resets the regexp state to match from the expression's beginning again. Their example is parsing lines of
Look at the |
Yeah, I think you're right Robin. It looks like match state isn't separated out in the RE2 api at all. As Tim pointed out earlier though, Hyperscan does that and appears to have an incremental/streaming api that would work for us. Perhaps that's the direction we should go? |
Yet another library: PIRE, GPL 3. |
Unfortunately PIRE is distributed under an incompatible license. |
That was my misunderstanding of what we actually needed from streaming matches. Now that I'm more clear on it, I completely agree that it doesn't meet the need. I marked that column back to red and made a note on the sheet about it. |
I opened a GHI on the RE2 repo about whether or not they support a stream API like we need. |
Issue link: google/re2#213 |
I just found google/re2#204, which indicates that re2 does not support capture groups in DFA mode. If I do not misread this, I assume that re2 is dead for our use in any case - since I think we wanted to use it in DFA mode. |
The response from the re2 project is that they do not support a streaming mode at all. I agree that re2 is dead for our purposes. |
I think we've reached an impasse on this issue where we can't find a library that sufficiently meets our needs. If no one has any other comments, we can shelve this issue and revisit it some point in the future when there may be a library that does cover what we need. EDIT: After talking to @rsmmr, we decided to shelve this issue for the time being. We don't want to implement something using a library that doesn't support everything we want, and then have something new come along that does and require a bunch of rework. I'll leave the issue open so it doesn't get lost, and we can revisit it later. |
I revisited all of the libraries in the google doc again the other day and nothing has changed with any of them in regards to what we need. I'm going to close this for now. |
I think RE-flex (documentation) is more suitable than others. |
@data-man Thanks, I hadn't seen that one yet. It looks like it supports most of the features, but depending on what matcher you use it may lose one or two of them. For example if you use PCRE matcher, you get DFA but not capture groups. |
My company (icebrg.io, now part of gigamon) actually replaced zeek/bro regex with hyperscan for MIME detections for a few years... I worked on related projects. The huge downside of hyperscan, is that it in reality doesn't support capture groups. For each match, you only get the end of the match (to), not the from of the match. (It supports a flag to enable returning from, but the flag makes a lot of regex expressions fail to compile.) In some applications we have to use hyperscan to find the match end, and then use other packages to figure out the capture group. -jason |
Yeah, lack of capture groups is one downside of hyperscan. Also, I think their optimizations at least are Intel-CPU specific, are there other portability issues? |
Now there is VectorScan. |
Oh nice, there's a version of Chimera there too that supports capture groups. It's possible that VectorScan supports everything we want it to. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Zeek is still using a custom regex engine that comes with some limitations. Let's revisit whether we could replace it with a more standard, external library. The answer is not clear unfortunately because we have some requirements that historically other engines had trouble meeting. Here's a stab at collecting our requirements for choosing a new engine:
We have multiple use cases for regexes in Zeek, and need these features in various combinations (e.g., streaming set matching preferring the earliest match).
When I looked around years ago, I didn't see any library doing all this; which is why I ended up writing a custom one that HILTI/Spicy is using. However, that one remains prototypish, and with its own challenges; wouldn't recommend using that for Zeek (nor eventually for Spicy).
The text was updated successfully, but these errors were encountered: