Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urls.zeek regular expression #540

Open
Crypto-Cat opened this issue Aug 13, 2019 · 1 comment

Comments

@Crypto-Cat
Copy link

commented Aug 13, 2019

Hi,

I noticed the find_all_urls() function within urls.zeek seems to have problems correctly extracting some URLs.

I compared the regex with some failed test cases on RegExr:

url_regex = /^([a-zA-Z-]{3,5})(://[^/?#"'\r\n><])([^?#"'\r\n><])([^[:blank:]\r\n"'><]|??[^"'\r\n><])/

Here's some examples of URLs which are being extracted from HTML/JS and which, when tested against the url_regex, produce a match (incorrectly):

http://www.iec.ch\x00\x
http://youngprotectors.com http://amwcomics.com
https://ws.sharethis.com/images/sprite_32.png)
http://w.sharethis.com/share4x/images/bubble_arrow_below.png) no-repeat 10px 40px;line-height:16px}.stButton .stBubbleSm{background-image:url(
http://w.sharethis.com/share4x/images/Twitter_bubble_arrow.png)}.st_twitter_vcount .stBubble_count,.st_twitter_vcount .stBubble_count{background:#fff;border:1px solid #cce3f3;filter:none}.st_twitter_button .stButton_gradient,.st_twitter_button .stButton_gradient:hover,.st_twitter_vcount .stButton_gradient,.st_twitter_vcount .stButton_gradient:hover,.st_twitter_hcount .stButton_gradient,.st_twitter_hcount .stButton_gradient:hover{border:1px solid #cce3f3;background:#fff;filter:none}.st_twitter_vcount .stBubble{background-image:url(
http://www.yaoi911.com&#xA;http://webcomics.yaoi911.com
http://pypesexhaust.com\x22\x3e\x3
http://www.vtem.net  All Rights Reserved */
http://www.example.com \xd9\x88\xd8\xa7\xd8\xb1\xd8\xaf \xda\xa9\xd9\x86\xdb\x8c\xd8\xaf.
http://theezpzway.com; Released under the MIT License

https://pastebin.com/nT7UheUz

You can paste the examples above into RegExr and with the url_regex and see the issue. Let me know if you want any more examples.. I'm thinking at very least, URL should have no spaces in it? There are many test cases where urls.zeek extracts multiple URLs seperated by a space and classifies them as a single URL e.g. "a.com b.com c.com" instead of seperately..

Thanks,
CryptoCat

@jsiwek jsiwek added this to the 3.1.0 milestone Aug 13, 2019

@jsiwek jsiwek added this to Unassigned / Todo in Release 3.1.0 via automation Aug 13, 2019

@jsiwek

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

The logic in the decompose_uri function could also use improvements so that arbitrary input won't trigger scripting errors like this:

$ zeek -e 'print decompose_uri("http://example.com:")'
error in /Users/jon/pro/zeek/zeek/scripts/base/utils/urls.zeek, line 122: bad conversion to count (to_count(parts[1]) and )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.