Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Add IPFS URL heuristic #4310

Merged
merged 7 commits into from
Feb 20, 2023

Conversation

twesterhever
Copy link
Contributor

Given IPFS' popularity among miscreants for phishing hosting and malware dissemination, the presence of an URL containing both "ipfs" as well as a random string reminiscent of an IPFS content identifier is a strong sign of maliciousness (I have never seen a legitimate IPFS URL so far, certainly not in mail traffic).

Please note that while IPFS CIDv0 are easy to parse due to their fixed syntax, CIDv1 neither have a fixed length nor any other static character sets. To avoid miscreants bypassing this heuristic by increasing the size of hashing algorithms used, the CIDv1 rexep is rather fuzzy, catching anything alphanumeric between 45 and 256 characters. Most CIDv1s seen so far, however, stayed between 60 and 120 characters.

See https://docs.ipfs.tech/concepts/content-addressing/ for details on CIDs, and how to parse them.

Copy link
Member

@vstakhov vstakhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule looks good itself, but I'd suggest to think about performance considerations

-- characters (CIDv0), or a CIDv1 of an alphanumerical string of unspecified length,
-- depending on the hash algorithm used.
local ipfs_cid = '/(qm[a-z0-9]{44}|[a-z0-9]{45,256})/{url}i'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regexp will be very bad from the performance considerations for Hyperscan (and probably PCRE as well). I'd appreciate if we can use something else but {45,256} here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I will look into this, and try to come up with a less costly regexp for parsing IPFS CIDv1s.

As requested by @vstakhov in rspamd#4310 (review), try to limit the performance impact of this regular expression. However, given that there does not seem to be a hard limit for CIDv1s in IPFS itself, using an hashing algorithm with large output my permit miscreants to get around this rule.
@twesterhever
Copy link
Contributor Author

The rule looks good itself, but I'd suggest to think about performance considerations

Having worked through the CIDv1 specification, the only things we can do against the performance costs of this regexp getting out of hand are:

  • Check whether a possible CIDv1 string starts with a multibase prefix
  • Limit the anticipated size of the total CIDv1 to something like 128 bytes (I have seen 110-bytes CIDv1s in the wild, so anything shorter does not seem to make sense).

I added commits implementing these changes. @vstakhov: What do you think of it?

@vstakhov
Copy link
Member

vstakhov commented Nov 6, 2022

@citrin has some thoughts about this rule, so he will probably comment a little later. Thank you for working on that!

@twesterhever
Copy link
Contributor Author

  • ping -

@twesterhever
Copy link
Contributor Author

Are there further changes needed from my end? I continue to observe IPFS phishing frequently, and would love to see this rule making it into rspamd, to provide better detection to its users.

@vstakhov vstakhov merged commit d31dde9 into rspamd:master Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants