feat: added url parsing to the filter#1889
Conversation
|
Realized I patched docker-compose.yml for my system, will remove that |
|
Can you clarify what false positives you found and how your code deals with them? |
|
Can you provide examples of false positives? Is there a related issue? |
@mbaruh yes, I'll link the issue and provide examples in the initial description |
|
@mbaruh check the original description |
ChrisLovering
left a comment
There was a problem hiding this comment.
Had the wrong option selected when requesting changes
D0rs4n
left a comment
There was a problem hiding this comment.
It looks fine overall, there's only one single thing.
ChrisLovering
left a comment
There was a problem hiding this comment.
Revoking my approval due to subdomains not being filtered here.
See this message and the discussion preceding it for context
https://canary.discord.com/channels/267624335836053506/635950537262759947/902203112834629692
Head branch was pushed to by a user without write access
onerandomusername
left a comment
There was a problem hiding this comment.
As a heads up, poetry.lock got all of the dependencies updated.
To fix this, please be sure you are using poetry 1.1.x
First revert poetry.lock locally.
Ensure that pyproject.toml is the same, with tldextract in it.
Next, use poetry lock --no-update
This will relock poetry without updating all of the dependencies.
However, it may be worth an update seperately to the dependencies, as this updated redis, rapidfuzz, sentry, etc
this has been deemed not to be a problem as of https://discord.com/channels/267624335836053506/635950537262759947/916168658651324487 |
|
Just a quick comment: This PR was made to improve the URL filter by removing false positives, like delicious-cookies.com being deleted for triggering cookies.com in the blacklist. As this continued, we added support to remove subdomains from any sent URLs to prevent circumvention. For any wondering why this exists, here you go 😄 |
After looking into this further, markdownify cannot be updated, as it will nerf the results of the doc command. |
|
Fixed that change in GH-2014 👍 |
onerandomusername
left a comment
There was a problem hiding this comment.
forgot to come back, approved now
mbaruh
left a comment
There was a problem hiding this comment.
Looks great and seems to be working. Thanks!
| @@ -481,7 +482,10 @@ async def _has_urls(self, text: str) -> Tuple[bool, Optional[str]]: | |||
| for match in URL_RE.finditer(text): | |||
| for url in domain_blacklist: | |||
| if url.lower() in match.group(1).lower(): | |||
There was a problem hiding this comment.
I'd save url.lower() and match.group(1).lower() into separate variables just because those values are used a couple of times each, but it's not a big deal here.
Closes #1260
Added some url parsing using urllib.parse to stop false positives. to remove false positives, I parsed the URLs and checked if the netloc was the same. If it wasn't, I passed. I also check if the netloc is prefixed with www. or any other subdomain should people try to circumvent the filter. An example false positive would be deliciouscookies.com triggering cookies.com, a blacklisted url (not linking actual example cause it's NSFW)