Skip to content

feat: added url parsing to the filter#1889

Merged
ChrisLovering merged 2 commits into
python-discord:mainfrom
Kronifer:url-parsing
Dec 26, 2021
Merged

feat: added url parsing to the filter#1889
ChrisLovering merged 2 commits into
python-discord:mainfrom
Kronifer:url-parsing

Conversation

@Kronifer
Copy link
Copy Markdown
Contributor

@Kronifer Kronifer commented Oct 18, 2021

Closes #1260

Added some url parsing using urllib.parse to stop false positives. to remove false positives, I parsed the URLs and checked if the netloc was the same. If it wasn't, I passed. I also check if the netloc is prefixed with www. or any other subdomain should people try to circumvent the filter. An example false positive would be deliciouscookies.com triggering cookies.com, a blacklisted url (not linking actual example cause it's NSFW)

@Kronifer
Copy link
Copy Markdown
Contributor Author

Realized I patched docker-compose.yml for my system, will remove that

@mbaruh
Copy link
Copy Markdown
Member

mbaruh commented Oct 18, 2021

Can you clarify what false positives you found and how your code deals with them?

@Kronifer
Copy link
Copy Markdown
Contributor Author

@mbaruh to remove false positives, I parsed the URLs and checked if the netloc was the same. If it wasn't, I passed. I also check if the netloc is prefixed with www. should people try to circumvent the filter

@mbaruh
Copy link
Copy Markdown
Member

mbaruh commented Oct 18, 2021

Can you provide examples of false positives? Is there a related issue?

@Kronifer
Copy link
Copy Markdown
Contributor Author

Can you provide examples of false positives? Is there a related issue?

@mbaruh yes, I'll link the issue and provide examples in the initial description

@Kronifer
Copy link
Copy Markdown
Contributor Author

@mbaruh check the original description

Comment thread bot/exts/filters/filtering.py Outdated
Comment thread bot/exts/filters/filtering.py Outdated
Comment thread bot/exts/filters/filtering.py Outdated
Comment thread bot/exts/filters/filtering.py Outdated
Copy link
Copy Markdown
Member

@ChrisLovering ChrisLovering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had the wrong option selected when requesting changes

Copy link
Copy Markdown
Member

@ChrisLovering ChrisLovering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, looks good 👍

@ChrisLovering ChrisLovering enabled auto-merge (squash) October 21, 2021 19:02
@Xithrius Xithrius added a: backend Related to internal functionality and utilities (error_handler, logging, security, utils and core) a: filters Related to message filters: (antimalware, antispam, filtering, token_remover) p: 2 - normal Normal Priority s: needs review Author is waiting for someone to review and approve labels Oct 23, 2021
@Xithrius Xithrius requested a review from kosayoda October 23, 2021 01:44
Copy link
Copy Markdown
Contributor

@D0rs4n D0rs4n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks fine overall, there's only one single thing.

Comment thread bot/exts/filters/filtering.py Outdated
Copy link
Copy Markdown
Member

@ChrisLovering ChrisLovering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revoking my approval due to subdomains not being filtered here.

See this message and the discussion preceding it for context
https://canary.discord.com/channels/267624335836053506/635950537262759947/902203112834629692

auto-merge was automatically disabled December 3, 2021 03:04

Head branch was pushed to by a user without write access

Copy link
Copy Markdown
Contributor

@onerandomusername onerandomusername left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a heads up, poetry.lock got all of the dependencies updated.

To fix this, please be sure you are using poetry 1.1.x

First revert poetry.lock locally.

Ensure that pyproject.toml is the same, with tldextract in it.

Next, use poetry lock --no-update
This will relock poetry without updating all of the dependencies.

However, it may be worth an update seperately to the dependencies, as this updated redis, rapidfuzz, sentry, etc

Comment thread bot/exts/filters/filtering.py Outdated
Comment thread bot/exts/filters/filtering.py
@Kronifer
Copy link
Copy Markdown
Contributor Author

Kronifer commented Dec 3, 2021

As a heads up, poetry.lock got all of the dependencies updated.

To fix this, please be sure you are using poetry 1.1.x

First revert poetry.lock locally.

Ensure that pyproject.toml is the same, with tldextract in it.

Next, use poetry lock --no-update This will relock poetry without updating all of the dependencies.

However, it may be worth an update seperately to the dependencies, as this updated redis, rapidfuzz, sentry, etc

this has been deemed not to be a problem as of https://discord.com/channels/267624335836053506/635950537262759947/916168658651324487

@Kronifer
Copy link
Copy Markdown
Contributor Author

Kronifer commented Dec 9, 2021

Just a quick comment:

This PR was made to improve the URL filter by removing false positives, like delicious-cookies.com being deleted for triggering cookies.com in the blacklist. As this continued, we added support to remove subdomains from any sent URLs to prevent circumvention. For any wondering why this exists, here you go 😄

@Kronifer Kronifer removed the request for review from Akarys42 December 15, 2021 01:36
@onerandomusername
Copy link
Copy Markdown
Contributor

this has been deemed not to be a problem as of https://discord.com/channels/267624335836053506/635950537262759947/916168658651324487

After looking into this further, markdownify cannot be updated, as it will nerf the results of the doc command.

@onerandomusername
Copy link
Copy Markdown
Contributor

Fixed that change in GH-2014 👍

Copy link
Copy Markdown
Contributor

@onerandomusername onerandomusername left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to come back, approved now

Copy link
Copy Markdown
Member

@mbaruh mbaruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great and seems to be working. Thanks!

@@ -481,7 +482,10 @@ async def _has_urls(self, text: str) -> Tuple[bool, Optional[str]]:
for match in URL_RE.finditer(text):
for url in domain_blacklist:
if url.lower() in match.group(1).lower():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd save url.lower() and match.group(1).lower() into separate variables just because those values are used a couple of times each, but it's not a big deal here.

@HassanAbouelela HassanAbouelela mentioned this pull request Dec 23, 2021
2 tasks
@Xithrius Xithrius requested review from MrHemlock and removed request for Den4200 and kosayoda December 24, 2021 00:00
Copy link
Copy Markdown
Member

@ChrisLovering ChrisLovering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :D

Copy link
Copy Markdown
Contributor

@Akarys42 Akarys42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proxy approval

@ChrisLovering ChrisLovering merged commit 24d0c4e into python-discord:main Dec 26, 2021
@Xithrius Xithrius removed the s: needs review Author is waiting for someone to review and approve label Feb 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a: backend Related to internal functionality and utilities (error_handler, logging, security, utils and core) a: filters Related to message filters: (antimalware, antispam, filtering, token_remover) p: 2 - normal Normal Priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding special logic when matching domains (bar.com matches foobar.com)

7 participants