Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore replacing Zeek's regex engine #426

Open
rsmmr opened this issue Jun 20, 2019 · 9 comments

Comments

Projects
None yet
3 participants
@rsmmr
Copy link
Member

commented Jun 20, 2019

Zeek is still using a custom regex engine that comes with some limitations. Let's revisit whether we could replace it with a more standard, external library. The answer is not clear unfortunately because we have some requirements that historically other engines had trouble meeting. Here's a stab at collecting our requirements for choosing a new engine:

  • DFA-based (for performace)
  • Support for extracting capture groups
  • Stream API: Feed input in arbitrary chunks, with each result indicating whether regexp has (1) already matched; (2) has not yet matched, but might still match with more input; or (3) will never match anymore.
  • Parallel matching for a set of regexes, with the result telling us which one matched. Parallel matching must scale in terms of memory and CPU to at least a similar size than our current engine (note that that one builds the DFAs incrementally on-demand to save memory).
  • Option to prefer the earliest match over the longest (strictly speaking this is not a Bro requirement right now, but Spicy needs it; and I'd rather use the same library for both).

We have multiple use cases for regexes in Zeek, and need these features in various combinations (e.g., streaming set matching preferring the earliest match).

When I looked around years ago, I didn't see any library doing all this; which is why I ended up writing a custom one that HILTI/Spicy is using. However, that one remains prototypish, and with its own challenges; wouldn't recommend using that for Zeek (nor eventually for Spicy).

@0xxon

This comment has been minimized.

Copy link
Member

commented Jun 20, 2019

@timwoj

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2019

Currently Hyperscan is winning in the hunt for a library, supporting everything that @rsmmr requested above plus being able to build regexes at compile time. I'm building a table of supported features at https://docs.google.com/spreadsheets/d/1B5kc1PgvOHF621AtrrbQ1LUPiaVBBWjx1Dc8_leTcmk/edit?usp=sharing

@timwoj

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2019

@rsmmr I have a couple of questions in the list above. For capture groups, do you want the ability to back-reference or just extract the captures from the resulting match? For earliest match vs longest match, are you just referring to lazy/non-greedy mode?

@rsmmr

This comment has been minimized.

Copy link
Member Author

commented Jul 12, 2019

@rsmmr

This comment has been minimized.

Copy link
Member Author

commented Jul 12, 2019

I thought TRE is DFA-based and does support capture groups (iirc, the combination of the two was one of the reasons it was developed in the first place)

@timwoj

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2019

I thought TRE is DFA-based and does support capture groups (iirc, the combination of the two was one of the reasons it was developed in the first place)

I don't see any documentation for DFA beyond a single comment in https://github.com/laurikari/tre/blob/6fb7206b935b35814c5078c20046dbe065435363/lib/tre-match-backtrack.c, and that's just for backtracking. I assume if they support backtracking, they must support capture groups as well.

I looked again and did find support for lazy mode as well. If the submatch stuff in regmatch_t means it supports capture groups, the only thing missing from TRE is parallel matching.

@timwoj

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2019

That leaves three libraries all missing one thing on the spreadsheet, and all of them missing something different. Is there anything on that list that could be skipped in favor of something else?

In regards to syntax compatibility, we could probably write a simple wrapper over whatever library we end up choosing.

@timwoj

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2019

A combination of Hyperscan and PCRE might be able to get us all the way there, as described in intel/hyperscan#17.

@timwoj

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2019

I added another library called Chimera, which is basically Hyperscan+PCRE. It supports capture groups which Hyperscan didn't, but it loses support for Hyperscan's streaming mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.