-
-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Add IPFS URL heuristic #4310
[Enhancement] Add IPFS URL heuristic #4310
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule looks good itself, but I'd suggest to think about performance considerations
rules/regexp/misc.lua
Outdated
-- characters (CIDv0), or a CIDv1 of an alphanumerical string of unspecified length, | ||
-- depending on the hash algorithm used. | ||
local ipfs_cid = '/(qm[a-z0-9]{44}|[a-z0-9]{45,256})/{url}i' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regexp will be very bad from the performance considerations for Hyperscan (and probably PCRE as well). I'd appreciate if we can use something else but {45,256}
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I will look into this, and try to come up with a less costly regexp for parsing IPFS CIDv1s.
As requested by @vstakhov in rspamd#4310 (review), try to limit the performance impact of this regular expression. However, given that there does not seem to be a hard limit for CIDv1s in IPFS itself, using an hashing algorithm with large output my permit miscreants to get around this rule.
Having worked through the CIDv1 specification, the only things we can do against the performance costs of this regexp getting out of hand are:
I added commits implementing these changes. @vstakhov: What do you think of it? |
@citrin has some thoughts about this rule, so he will probably comment a little later. Thank you for working on that! |
|
Are there further changes needed from my end? I continue to observe IPFS phishing frequently, and would love to see this rule making it into rspamd, to provide better detection to its users. |
Given IPFS' popularity among miscreants for phishing hosting and malware dissemination, the presence of an URL containing both "ipfs" as well as a random string reminiscent of an IPFS content identifier is a strong sign of maliciousness (I have never seen a legitimate IPFS URL so far, certainly not in mail traffic).
Please note that while IPFS CIDv0 are easy to parse due to their fixed syntax, CIDv1 neither have a fixed length nor any other static character sets. To avoid miscreants bypassing this heuristic by increasing the size of hashing algorithms used, the CIDv1 rexep is rather fuzzy, catching anything alphanumeric between 45 and 256 characters. Most CIDv1s seen so far, however, stayed between 60 and 120 characters.
See https://docs.ipfs.tech/concepts/content-addressing/ for details on CIDs, and how to parse them.