Skip to content

Conversation

@mbaruh
Copy link
Member

@mbaruh mbaruh commented Aug 21, 2021

Closes #622, closes #665, closes #1522

This PR gives the AntiSpam cog the ability to detect spam of users across channels, and even when their messages (or part of them) were deleted, either by them or automatically because of another filter.

Implementation

The feature was implemented by making the on_message listener search for messages in a cache, rather than the history of the channel the message was sent in.

The MessageCache class

The cache used is an instance of a class added in this PR, rather than the one provided by discord.py. This has several advantages:

  • The collection of messages to search in can be defined to be significantly smaller, as the cog only cares about the messages sent in the last few seconds.
  • discord.py removes deleted messages from the cache, while the new cache implementation does not. This allows easily handling deleted messages in the anti-spam cog.
  • Because we don't remove elements from the middle of the sequence, the implementation of the cache could be made such that appending and popping from either side of it, as well as lookup by index and by message ID, can be implemented to run in constant time.

Not all features of the MessageCache (particularly slicing) are currently needed, but I preferred making a complete feature that didn't require too many future alterations, as we were recently discussing doing more things with it.

Changes to DeletionContext

As a spam event is no longer defined by a single channel, I couldn't use it anymore to identify a deletion context in the message_deletion_queue. I chose to identify a spam event by the users causing it instead, allowing additional channels to be added to the deletion context, instead of allowing adding additional members as was before.

This can run into the edge-case of one member becoming irrelevant to the filter while others still are, resulting in another log message being sent, but it's an unlikely edge case since the users should be muted almost immediately, and we're currently not using any multi-member filters in the first place. If new members are causing spam in the same channels in the period of time between the creation of the deletion context, and the alerting, alerting for the new members separately seems acceptable.

mbaruh and others added 5 commits August 17, 2021 17:58
The on_message event calculated the max interval value every time for no reason. The value is constant throughout the bot's up time.
The anti-spam cog now uses a cache instead of reading channel history.
The cache is for all channels in the guild, and does not remove deleted messages.
That means that the anti-spam logic now works cross-channel and counts deleted messages.
The size of the cache is determined via a new field in the config YAML file.

The cache was implemented as a separate class, MessageCache, which uses circular buffer logic.
This allows for constant time addition and removal form either side, and lookup. The cache does not support removal from the middle of the cache.
The cache additionally stores a mapping from message ID's to the index of the message in the cache, to allow constant time lookup by message ID.

The commit additionally adds accompanying tests, and renames `cache.py` to `caching.py` to better distinguish it from the new `message_cache.py` and convey that it's for general caching utilities.
The anti-spam cog was amended to handle cross-channel spam.
Since the anti-spam now works cross-channels, it makes no sense to identify it by the channel in which it was invoked.
The DeletionContext class was changed to accept a frozenset of members, and the message_deletion_queue dict uses the frozensets as keys.
DeletionContext still accepts a channel on creation, because while it might get added more channels, there's only one channel in which the mute message will be sent.

Using members as the key can run into the issue of one member becoming irrelevant to the filter while others still are, resulting in another log message being sent, but it's an unlikely edge case since the users should be muted almost immediately, and we're currently not using any multi-member filters in the first place.
@mbaruh mbaruh added a: filters Related to message filters: (antimalware, antispam, filtering, token_remover) p: 1 - high High Priority labels Aug 21, 2021
Copy link
Contributor

@Akarys42 Akarys42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is very well written, and the feature works perfectly well. Thanks a lot, Zig, favourite zebra ever!

Removed unused import, corrected docstring, and removed unnedded type annotation.
mbaruh and others added 2 commits August 21, 2021 20:52
getitem based iteration included operations that aren't necessary when iterating over the cache continuously. Adding an iter method to the class seems to have improved iteration speed by several orders of magnitude.
@Xithrius Xithrius added the t: feature New feature or request label Aug 23, 2021
Comment on lines +249 to +253
channel_messages = defaultdict(list)
for message in messages:
channel_messages[message.channel].append(message)
for channel, messages in channel_messages.items():
await channel.delete_messages(messages)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this, you go through each message and add them to the list in the dictionary?

Then you go through this dictionary and delete those messages in bulk?

Why is the channels a default dict? I thought we already had all channels because of the set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm grouping the messages per channel to be able to use the delete_messages method on all messages of that channel at the same time.

"""Add the received message to the end of the cache."""
if self._is_full():
del self._message_id_mapping[self._messages[self._start].id]
self._start = (self._start + 1) % self.maxlen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps _start could be a property?

Copy link
Member Author

@mbaruh mbaruh Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit?

index = self._message_id_mapping.get(message_id, None)
return self._messages[index] if index is not None else None

def update(self, message: Message) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this used? This should be a reference to the same Message object, so it will be changed "automatically".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in the antispam's on_message_edit. The listener receives two different Message objects.


def __iter__(self) -> t.Iterator[Message]:
if self._is_empty():
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need to return an empty iterator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iteration stops when you return. There's no difference between having something to iterate on first and then return, and having nothing to iterate on and then return.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting, doesn't type checkers complain though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing that I've seen so far. Once you have a yield statement the function returns a generator which you use to get the values, and a generator is an iterator.

Copy link
Contributor

@wookie184 wookie184 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked through the MessageCache implementation too closely, but the rest looks great and seems to work perfectly.

Copy link
Member

@Bluenix2 Bluenix2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🦓. Looks good!

@mbaruh mbaruh merged commit 697c0da into main Aug 23, 2021
@mbaruh mbaruh deleted the mbaruh/anti-spam branch August 23, 2021 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a: filters Related to message filters: (antimalware, antispam, filtering, token_remover) p: 1 - high High Priority t: feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Keep deleted messages in cache Support Anti-Spam Checks Across Channels Utilize a Message Cache for the Antispam Cog

7 participants