Skip to content

Commands for convenient regex testing#952

Closed
decorator-factory wants to merge 14 commits into
python-discord:masterfrom
decorator-factory:decorator-factory-add-re-command
Closed

Commands for convenient regex testing#952
decorator-factory wants to merge 14 commits into
python-discord:masterfrom
decorator-factory:decorator-factory-add-re-command

Conversation

@decorator-factory
Copy link
Copy Markdown
Contributor

  • create RegularExpressions cog
  • create !regexp search command

@decorator-factory decorator-factory requested a review from a team as a code owner May 19, 2020 02:29
@decorator-factory decorator-factory requested review from Den4200 and MrHemlock and removed request for a team May 19, 2020 02:29
@decorator-factory decorator-factory linked an issue May 19, 2020 that may be closed by this pull request
Comment thread bot/cogs/regular_expressions.py Outdated

@group(name='regexp', aliases=('regex', 're'), invoke_without_command=True)
async def regexp_group(self, ctx: Context) -> None:
"""Commands for exploring the misterious worldof regular expressions."""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typos here.

Suggested change
"""Commands for exploring the misterious worldof regular expressions."""
"""Commands for exploring the mysterious world of regular expressions."""

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll fix that. I think format_* docstrings can be improved as well.

@jb3
Copy link
Copy Markdown
Member

jb3 commented May 20, 2020

I'll take a more in depth look at the code soon, but one question I have now is whether executing arbitrary RegExp code is safe?

We used to have an issue on the bot where the RegExp used for code block detection resulted in the bot locking up due something called catastrophic backtracing. How can we prevent unsafe regular expressions from locking up the bot?

@jb3
Copy link
Copy Markdown
Member

jb3 commented May 20, 2020

Found the conversation for when we discovered the issue in the bot, the solution was to adjust the regular expression to something safe but this still leaves the risk of abuse if people use dangerous regular expressions.

@decorator-factory
Copy link
Copy Markdown
Contributor Author

@Jos-B Why not set a small time limit?

@jb3
Copy link
Copy Markdown
Member

jb3 commented May 20, 2020

Right, but then we need to isolate the process which would require migrating a regex executor to another platform (which with the new UNIX commands in snekbox and possibility for more things in future is feasible) which would make this implementation obsolete.

@TeamSpen210
Copy link
Copy Markdown

Could they use snekbox somehow? Format the regex into a template snippet which executes it, then prints to stdout a JSON result. Then parse that in the bot, handling errors.

@jb3
Copy link
Copy Markdown
Member

jb3 commented May 20, 2020

@jb3
Copy link
Copy Markdown
Member

jb3 commented May 20, 2020

Snekbox could definitely be used, either by adding a new executor or by basically sending all the processing code as a regular eval to snekbox.

@jb3
Copy link
Copy Markdown
Member

jb3 commented May 20, 2020

However at that point I think it becomes something where we need to discuss things among the core developers before we can think about implementation detail.

@decorator-factory
Copy link
Copy Markdown
Contributor Author

decorator-factory commented May 20, 2020

The third-party regex module seems to handle such regular expressions without any issues. All examples of catastrophic regexes I could find work instantly.
The stackoverflow example:
image

@decorator-factory
Copy link
Copy Markdown
Contributor Author

A possible issue with regex is that it adds some extra regex features, so a regular expression might work with the command, but not in python with re, which might confuse people. A simple fix would be to parse the regex using re as well and see if it throws an error.

decorator-factory and others added 6 commits May 20, 2020 04:31
Turns out that bumping the flake8 version up to 3.8 introduces a long
list of new linting errors. Since this PR is the one that bumps the
version, I suppose we will also fix all the linting errors in this
branch.

(cherry picked from commit e993566)
Copy link
Copy Markdown
Contributor

@ks129 ks129 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one bug. One other thing I want to say is when I tested it, all results (errors and matches) is so plain. I think some language formatting should be added to there. This will increase readability.
Screenshot 2020-05-21 at 08 54 27 for errors I made this with diff formatting and added - before each line, then I got the red text. And maybe should success result inside embed that implements wait_for_deletion?

@group(name='regexp', aliases=('regex', 're'), invoke_without_command=True)
async def regexp_group(self, ctx: Context) -> None:
"""Commands for exploring the misterious world of regular expressions."""
await ctx.invoke(self.bot.get_command("help"), "regexp")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work. We migrated to new help command lately, so this has to be changed:

Suggested change
await ctx.invoke(self.bot.get_command("help"), "regexp")
await ctx.send_help(ctx.command)

@MarkKoz MarkKoz added a: utility Related to utility commands: (bot, eval, extensions, jams, reminders, snekbox, utils) p: 3 - low Low Priority labels May 22, 2020
@MarkKoz MarkKoz added status: needs review t: feature New feature or request labels May 22, 2020
Copy link
Copy Markdown
Contributor

@MarkKoz MarkKoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like this feature and want to see it added. If the regex module handles catastrophic expressions well, but re doesn't, then re support needs to be dropped. I am not too concerned about people being mislead by the extended features of regex. However, if possible, using re to check for successful compilation is a good idea. On the other hand, if re can also handle catastrophic expression, then the regex module is redundant.

Regarding execution time, let's keep also keep in mind that users will only be able to search at most ~2000 characters of text.

Comment thread bot/cogs/regular_expressions.py Outdated
"""Commands for exploring the misterious world of regular expressions."""
await ctx.invoke(self.bot.get_command("help"), "regexp")

@regexp_group.command(name='search', aliases=('find', 's', '🔍'))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the use of emojis to invoke commands. Emojis are general-purpose, so someone could unintentionally invoke this command.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, if re can also handle catastrophic expression, then the regex module is redundant.
re easily freezes the interpreter if you run the example linked by @J03B
So I'm going to use re just for compatibility validation.

Comment thread bot/cogs/regular_expressions.py Outdated
@regexp_group.command(name='search+', aliases=('find+', 's+', '🔍+'))
async def match_plus_command(self, ctx: Context, pattern: RegexRegex, test: str) -> None:
"""
Like `!re search`, but with an extended regex format.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bad user experience to make them have to invoke the help command again to look at the docstring for !re search. Just copy the docstring over to this one.

Comment thread bot/cogs/regular_expressions.py Outdated

@group(name='regexp', aliases=('regex', 're'), invoke_without_command=True)
async def regexp_group(self, ctx: Context) -> None:
"""Commands for exploring the misterious world of regular expressions."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Commands for exploring the misterious world of regular expressions."""
"""Commands for exploring the mysterious world of regular expressions."""

Comment thread bot/cogs/regular_expressions.py Outdated
Comment on lines +93 to +94
ReRegex = ConvertRegex(supports_extended_features=False)
RegexRegex = ConvertRegex(supports_extended_features=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better names would be Regex and ExtendedRegex.

Comment on lines +30 to +40
r"""
Format a match result to display in a response.

>>> format_match(re.search("(\d\d)+", "hello123456world"))
[' hello123456world', ' 0: ^^^^^^', ' 1: ^^']

Which will look as:
| hello123456world
| 0: ^^^^^^
| 1: ^^
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about showing the actual text matched instead of the carets? That may end up being easier to read, especially if there are a lot of groups (eyes won't have to travel far to match the carets to the original string).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better:

    |    hello123456world
    | 0:      123456
    | 1:          56

just showing the text might be ambiguous if a substring appears multiple times in a string.

However, we still have to figure out how to display zero-width matches 🤔

Comment thread bot/cogs/regular_expressions.py Outdated
await ctx.send(match_and_format(pattern, test))

@regexp_group.command(name='search+', aliases=('find+', 's+', '🔍+'))
async def match_plus_command(self, ctx: Context, pattern: RegexRegex, test: str) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test is not a clear name for the parameter. Perhaps simply named it string?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter should also be a kw-only arg so it "consumes all" if the string has spaces. This means the pattern needs to be in quotes if it has spaces, but I think this is the best option we have.

Copy link
Copy Markdown
Contributor Author

@decorator-factory decorator-factory Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that. Is there any way to put the pattern or the string inside a code block?

/re search `\d+` 123456helloworld

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. I think discord.py does not consider backticks to be the same as quotes, which means you will need to parse all arguments yourself from a single string to accomplish this. But don't take my word for it, double check this.

@jb3
Copy link
Copy Markdown
Member

jb3 commented Jun 3, 2020

I've been investigating regex a bit more and have found it is still very much susceptible to DoS attacks and should not be used with arbitrary user input.

I'll attach an example:

>>> import regex
>>> regex.match("^(([a-z])+.)+[A-Z]([a-z])+$", "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!")
... hang ...

@decorator-factory
Copy link
Copy Markdown
Contributor Author

decorator-factory commented Jun 3, 2020

I've been investigating regex a bit more and have found it is still very much susceptible to DoS attacks and should not be used with arbitrary user input.

I'll attach an example:

>>> import regex
>>> regex.match("^(([a-z])+.)+[A-Z]([a-z])+$", "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!")
... hang ...
>>> regex.match("^(([a-z])+.)+[A-Z]([a-z])+$", "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!",
... timeout=0.05)
>>> TimeoutError: regex timed out

@jb3
Copy link
Copy Markdown
Member

jb3 commented Jun 3, 2020

Huh!

@decorator-factory
Copy link
Copy Markdown
Contributor Author

Good point, though, I'll add that as well.

@nnsee
Copy link
Copy Markdown
Contributor

nnsee commented Jun 3, 2020

Alternatively, we can use snekbox for execution, leaving DoS and other similar attacks out of the equation. Would require some effort to move the code there, though.

EDIT: I see this has been discussed already, my bad.

@decorator-factory
Copy link
Copy Markdown
Contributor Author

Now it looks like this:
image

Comment thread bot/cogs/regular_expressions.py Outdated
await ctx.invoke(self.bot.get_command("help"), "regexp")

@regexp_group.command(name='search', aliases=('find', 's'))
async def match_command(self, ctx: Context, pattern: Regex, test: str) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def match_command(self, ctx: Context, pattern: Regex, test: str) -> None:
async def match_command(self, ctx: Context, pattern: Regex, *, test: str) -> None:

Comment thread bot/cogs/regular_expressions.py Outdated
await ctx.send(match_and_format(pattern, test))

@regexp_group.command(name='search+', aliases=('find+', 's+'))
async def match_plus_command(self, ctx: Context, pattern: ExtendedRegex, test: str) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def match_plus_command(self, ctx: Context, pattern: ExtendedRegex, test: str) -> None:
async def match_plus_command(self, ctx: Context, pattern: ExtendedRegex, *, test: str) -> None:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added now

@decorator-factory
Copy link
Copy Markdown
Contributor Author

decorator-factory commented Jun 3, 2020

TODO:

  • ⚠️ move the matching/searching to snekbox
  • somehow show zero-width groups (right now they just don't show up)
  • support named groups (right now they show up as anonymous groups)
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a: utility Related to utility commands: (bot, eval, extensions, jams, reminders, snekbox, utils) p: 3 - low Low Priority t: feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create !regexp command

6 participants