FilterLists: Manage whitelisting and blacklisting via the bot #1058

lemonsaurus · 2020-07-19T19:45:44Z

Up until now, we have managed various whitelists and blacklists via adding constants to our default-config.yml file. However, it has often been suggested that we should instead be managing these via a bot command, and storing them in the database.

Related issues:

NOTE: This pull request must be merged after python-discord/site#305!

This is a big pull request, best reviewed commit-by-commit, which makes a number of changes to filtering, antimalware, constants, and implements a new cog for managing blacklists and whitelists.

Cache blacklists and whitelists, to minimize API requests.

We don't want to be making API request for every single message, but we need to scan all messages with our filters and see if they contain blacklisted words, or server invites. The solution is caching. This pull requests makes the bot cache all the AllowDenyList data from the site when the bot starts, and then updates this cache whenever the data changes.

Changes to the exception handler

Currently, we're not returning the error message whenever a BadArgument is raised from a command. This was done because some BadArguments would contain the input data, and we would then post that input data directly into the response from the bot when the exception was raised. This allowed users to exploit the !raw command with something like !raw @Developers in order to have the bot ping 65 000 users in our guild and create absolute chaos.

So, to prevent that, we just made BadArgument not return the exception message at all. But this isn't a great solution, because the exception message often contains valuable information that our users need in order to figure out how to pass a valid argument. For the cog I was making for the whitelist and blacklist, I felt I needed to provide this kind of information to the users, particularly to tell them which kinds of blacklists and whitelists we support.

This PR changes the exception handler so that all types of exceptions handled there will return an embed containing the full exception message. This strikes a nice balance between usability and security, since pings do not work from inside embeds.

New cog: FilterLists

The main attraction of this pull request is of course the new FilterLists cog. This cog allows us to add or remove items in the whitelist or the blacklist, and to show all the items that are currently there.

Adding file formats:

Adding guild invites:

Removing stuff:

Listing guild invites:

Listing file formats:

Changes to AntiMalWare and Filtering

These two existing cogs have gotten some kaizen, and have been refactored to utilize the blacklists and whitelists in the cache instead of relying on constants.

The constants they used before have been removed.

We no longer differentiate between word_watchlist and token_watchlist - if you need word boundaries before and after your regex, you just add a \b. This new system, simply called filter_tokens, allows you then to input any regex you want with no implicit magic.

New converters

In order to validate list types, I've created a converter called ValidAllowDenyListType, which will make an API request to figure out exactly which kinds of blacklists and whitelists we support. This means that the available blacklists and whitelists is managed in realtime from a single location - an enum inside the model definition on the site. This is the DRYest way I could think of to do this.

In order to validate guild invites, I've created a converter called ValidDiscordServerInvite, which does the following:

Checks the string against a compiled regex pattern which contains all legal variations of guild invites.
If it gets a match, it uses the Discord API to look up the invite and get information about the guild. The converter then returns this information as a dictionary, including stuff like the guild name. We always attach the guild name as a comment when adding a new guild invite to the blacklist or whitelist, so that you can see which Guild ID's belong to which guilds.
If the string contains a Guild ID, it raises a BadArgument saying Guild ID's can't be resolved.
Otherwise, it raises a BadArgument saying the invite is invalid.

New guild filtering logic

As discussed in https://github.com/python-discord/organisation/issues/261, we wanted a slightly more nuanced approach to filtering guild invites. This PR makes these changes, so that the following is true:

A blacklisted guild invite will always trigger an alert
A whitelisted guild invite will never trigger an alert
Partnered and Verified guild invites only trigger an alert if they are blacklisted.
All other guild invites will trigger an alert.

We shouldn't be making an API call for every single message posted, so what we're gonna do is cache the data in the Bot, and then update the cache whenever we make changes to it via our new AllowDenyList cog. Since this cog will be the only way to make changes to this, this level of lazy caching should be enough to always keep the cache up to date.

We'll use this to ensure the input is valid when people try to whitelist or blacklist stuff. It will fetch its data from an Enum maintained on the site, so that the types of lists we support will only need to be maintained in a single place, instead of duplicating that data in the bot and the site.

Currently, some types of errors are returning plain strings that repeat the input (which can be exploited to deliver stuff like mentions), and others are returning generic messages that don't give any exception information. This commit unifies our approach around putting as much information as we can (including the exception message), but always putting it inside an embed, so that stuff like pings will not fire. This, combined with the 1.4.0a `allowed_mentions` functionality, seems like a reasonable compromise between security and usability.

Instead of just dumping the JSON response from the site, we'll build a data structure that it will be convenient to access from our new cog, and from the Filtering cog.

This includes commands to add, remove and show the items in the whitelists and blacklists for the different list types. Commands are limited to Moderators+.

Instead of fetching the guild invite IDs from config-default.yml, we will now be using the AllowDenyList cache to check these.

Also updates the tests for this cog.

This gives easier access to the Guild ID in the place where you're most likely to want to use the whitelist command.

We will now validate and convert any standard discord server invite to a guild ID, and automatically add the name of the server as a comment. This will ensure that the list of whitelisted guild IDs will be readable and nice. This also makes minor changes to list output aesthetics.

We want to support deletion of both IDs and guild invites, so we need a bit of special handling for that.

We now filter guild invites the following way: - Whitelisted invites are always permitted. - Blacklisted invites are never permitted. - If the invite is not blacklisted, it is permitted only if it is a Verified or a Partnered server, otherwise not. This strategy was decided on during the June 7th staff meeting, see https://github.com/python-discord/organisation/issues/261

bot/utils/regex.py

bot/converters.py

bot/cogs/error_handler.py

bot/converters.py

bot/cogs/allow_deny_lists.py

bot/cogs/filtering.py

bot/cogs/allow_deny_lists.py

bot/cogs/filtering.py

bot/cogs/allow_deny_lists.py

bot/bot.py

MarkKoz · 2020-07-19T23:27:56Z

bot/cogs/antimalware.py

@@ -38,6 +38,16 @@ class AntiMalware(Cog):
    def __init__(self, bot: Bot):
        self.bot = bot

+    def _get_whitelisted_file_formats(self) -> list:
+        """Get the file formats currently on the whitelist."""
+        return [item.get('content') for item in self.bot.allow_deny_list_cache['file_format.True']]


It's slightly unfortunate you have to use a comprehension every time you need a list of contents. Can the data be restructured to avoid this? In practice, you only need the content and the comments, so caching them in a dict with the former as keys and latter as values may work. Admittedly, it does sound a bit strange, but the contents are guaranteed to be unique already.

Hmm, you've got a point. I mean, this is cheap since this list will always be really really short, but we are running this list comprehension for every single message that contains an attachment and isn't sent by a staff member.

I suppose caching may be possible here.

By "caching", I meant modifying the structure of the cache you've implemented rather than creating a separate cache.

Sorry, I have no idea how you propose we do that.

This means we can validate invites that start with https://, whereas before we could not.

MarkKoz

This works really well. I really only have one requested change.

Do you think explicit feedback if a command fails would be useful? One can only assume if it lacks a reaction.
Having to include the dot for file extensions may trip people up, but I think changing the anti-malware code to support the lack of a dot would be annoying.
I wish there was a way to show which list types are supported in the help text for commands. Right now, the only way to see that is to put some bad list type and rely on the converter's error message.
I really dislike how big the error messages are now. I think it needs to stop displaying the help command if it's going to be using an embed for the error. Alternatively, somehow merge the help command and error message into one embed (sounds complicated and out of scope). If help isn't displayed, to figure out a command, users would have to 1. know the help command exists 2. move to a different channel to use it (unless they're staff).

bot/cogs/filter_lists.py

We want the !help invocations to give you all the information you need in order to use the command. That also means we need to provide the valid filterlist types, which are subject to change. So, we fetch the valid ones from the API and then dynamically insert them into the docstrings.

For deleting and listing data, we now get some more feedback when things fail.

Den4200

Everything looks good and works well! Just a few minor suggestions.

bot/__main__.py

bot/cogs/filter_lists.py

kwzrd

Only stylistic nit-picks from an initial look ~ I'll be able to do a more thorough review later in the week if still necessary. But it's looking good.

bot/cogs/filter_lists.py

@Den4200

Thanks @Den4200!

Leon Sandøy added 17 commits July 17, 2020 21:30

Kaizen: Move OffTopicName to converters.py.

db79b6a

Better BadArgument exception text.

ccc2e7a

Bulletlist with valid file types in converter.

7d7fdd7

Make the cache more convenient to access.

b1311ea

Instead of just dumping the JSON response from the site, we'll build a data structure that it will be convenient to access from our new cog, and from the Filtering cog.

Add AllowDenyLists cog.

4d1b6a3

This includes commands to add, remove and show the items in the whitelists and blacklists for the different list types. Commands are limited to Moderators+.

Add more logging to AllowDenyLists cog.

2228b42

Remove Filtering constants, use cache data.

d07b1af

Instead of fetching the guild invite IDs from config-default.yml, we will now be using the AllowDenyList cache to check these.

Remove AntiMalWare constants, use cache data.

1c569f2

Also updates the tests for this cog.

Fix an absolutely terrible comment.

0d51d35

Include the guild ID in mod-log embed.

da26036

This gives easier access to the Guild ID in the place where you're most likely to want to use the whitelist command.

Support the new AllowDenyList field, 'comment'.

064130f

Validation of guild invites for delete.

f771222

We want to support deletion of both IDs and guild invites, so we need a bit of special handling for that.

lemonsaurus requested a review from a team as a code owner July 19, 2020 19:45

lemonsaurus requested review from eivl and MrHemlock and removed request for a team July 19, 2020 19:45

ghost added the needs 2 approvals label Jul 19, 2020