Parse URL and match domain name in filtering to avoid false positives#1275
Parse URL and match domain name in filtering to avoid false positives#1275ks129 wants to merge 1 commit into
Conversation
MarkKoz
left a comment
There was a problem hiding this comment.
This isn't too accurate because not all URLs posted by users will include the HTTP scheme. Unfortunately, this will cause problems with urllib even if the URL regex is adjusted:
Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.
Furthermore, the netloc may include a port and/or a subdomain e.g. www.cwi.nl:80. Some filters may need to match only on specific subdomains while others will need to match all subdomains. Adding a separate filter for each possible subdomain is infeasible, so the code needs to somehow account for this.
|
Though I think that issue could be addressed together with this one, even if set aside, my other points still need to be addressed. Someone could easily spoof the filter by omitting the scheme or specifying a port. |
|
Leaving this to somebody smarter 😅 |
Fixes #1260
Before bot just used
incheck to find blacklisted domains, but this resulted false positives like is showed in linked issue. Now this find all URLs and parse them withurllibparser, then get domain name and check it against blacklist.