Skip to content

Parse URL and match domain name in filtering to avoid false positives#1275

Closed
ks129 wants to merge 1 commit into
python-discord:masterfrom
ks129:domain-match-fix
Closed

Parse URL and match domain name in filtering to avoid false positives#1275
ks129 wants to merge 1 commit into
python-discord:masterfrom
ks129:domain-match-fix

Conversation

@ks129
Copy link
Copy Markdown
Contributor

@ks129 ks129 commented Nov 7, 2020

Fixes #1260

Before bot just used in check to find blacklisted domains, but this resulted false positives like is showed in linked issue. Now this find all URLs and parse them with urllib parser, then get domain name and check it against blacklist.

@ks129 ks129 requested a review from a team as a code owner November 7, 2020 18:10
@ks129 ks129 requested review from MarkKoz and tagptroll1 and removed request for a team November 7, 2020 18:10
@ghost ghost added the needs 2 approvals label Nov 7, 2020
Copy link
Copy Markdown
Contributor

@MarkKoz MarkKoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't too accurate because not all URLs posted by users will include the HTTP scheme. Unfortunately, this will cause problems with urllib even if the URL regex is adjusted:

Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.

Furthermore, the netloc may include a port and/or a subdomain e.g. www.cwi.nl:80. Some filters may need to match only on specific subdomains while others will need to match all subdomains. Adding a separate filter for each possible subdomain is infeasible, so the code needs to somehow account for this.

@ghost ghost added s: waiting for author Waiting for author to address a review or respond to a comment and removed needs 2 approvals labels Nov 16, 2020
@ks129
Copy link
Copy Markdown
Contributor Author

ks129 commented Nov 21, 2020

@MarkKoz #1276 this issue is for new filter that handles subdomains too, not only exact matches.

@MarkKoz
Copy link
Copy Markdown
Contributor

MarkKoz commented Nov 24, 2020

Though I think that issue could be addressed together with this one, even if set aside, my other points still need to be addressed. Someone could easily spoof the filter by omitting the scheme or specifying a port.

@ks129
Copy link
Copy Markdown
Contributor Author

ks129 commented Dec 4, 2020

Leaving this to somebody smarter 😅

@ks129 ks129 closed this Dec 4, 2020
@Akarys42 Akarys42 reopened this Dec 4, 2020
@Akarys42 Akarys42 closed this Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

s: waiting for author Waiting for author to address a review or respond to a comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding special logic when matching domains (bar.com matches foobar.com)

3 participants