Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch BitTorrent analyzer to Zeek's standard regex engine #1822

Closed
rsmmr opened this issue Oct 20, 2021 · 8 comments · Fixed by #1910
Closed

Switch BitTorrent analyzer to Zeek's standard regex engine #1822

rsmmr opened this issue Oct 20, 2021 · 8 comments · Fixed by #1910
Labels
Area: Event Engine Area: Regex Complexity: Modest A cup of tea and an evening (or two) with Zeek. good first issue A good place to get started working with Zeek. Type: Maintenance

Comments

@rsmmr
Copy link
Member

rsmmr commented Oct 20, 2021

Turns out that BitTorrentTracker.cc uses POSIX regex functions (#include <regex.h>, regcomp/regexec) instead of Zeek's own engine. We should switch that over to remove that dependency and provide consistency.

@rsmmr rsmmr added good first issue A good place to get started working with Zeek. Complexity: Modest A cup of tea and an evening (or two) with Zeek. Type: Maintenance Area: Event Engine Area: Regex labels Oct 20, 2021
@avinal
Copy link
Contributor

avinal commented Oct 20, 2021

Hello @rsmmr, I would like to work on this issue. I went through BitTorrentTracker.cc and got an idea of what needs to be done. But I couldn't find any resources on Zeek's custom regex. Can you please elaborate on that?

@rsmmr
Copy link
Member Author

rsmmr commented Oct 21, 2021

Cool, thanks. The API for Zeek's regex code is in src/RE.h, you can use the RE_Matcher class. Looking for an example in existing code, I see analyzer/protocol/login/Login.cc for example. (That code creates them from script-level variables, which in this case here we don't need: you can just hardcode the patterns when creating the RE_Matcher).

Let me know if anything else would be helpful.

@avinal
Copy link
Contributor

avinal commented Oct 21, 2021

Thanks for this. I will look into it and try to implement.

@rsmmr
Copy link
Member Author

rsmmr commented Nov 4, 2021

How is it going? Anything else we can help with?

@avinal
Copy link
Contributor

avinal commented Nov 4, 2021

Hey @rsmmr,
I had couple of interviews and some college stuffs so couldn't complete it. I have changed most of the part. There are few doubts, I will put them here. I will complete it by end of this week. Apologies for the delay.

@avinal
Copy link
Contributor

avinal commented Nov 5, 2021

So I am using MatchAnywhere() in place of regexec(). But the problem is, the first function only returns one value, denoting the character of the match, but the latter returns two values. I can't think of a way to get the length of the match too.
image

This is the particular case where I am having problems:
https://github.com/zeek/zeek/blob/master/src/analyzer/protocol/bittorrent/BitTorrentTracker.cc#L269

I went through other functions available and none of them seems fit to this. Do tell me if I missed anything.

@rsmmr
Copy link
Member Author

rsmmr commented Nov 5, 2021

Good catch, yeah, it looks like we actually don't have the right API for doing the matching in the same way. We should be ok for r_get and r_hdr because they are anchored, so MatchPrefix should work there. But for re_get_end the code needs both start and end position, which we our regex matcher indeed doesn't provide currently.

We could (and probably should) add a method that does that, but that's not quite trivial because of how the regex works internally. So for now, what I would suggest instead is to split re_get_end up into two matches: one starting from where r_get ends and then looking for the next [ \t ] from there; that will be the desired starting location. And then a second match starting there looking for the original [ \t]+HTTP/[0123456789.]+$ through MatchPrefix. (And searching for the initial [ \t ]wouldn't even need to be a regex match; could just be a for loop).

@avinal
Copy link
Contributor

avinal commented Nov 6, 2021

Ok thanks, I will try to implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Event Engine Area: Regex Complexity: Modest A cup of tea and an evening (or two) with Zeek. good first issue A good place to get started working with Zeek. Type: Maintenance
Projects
None yet
2 participants