Skip to content

Python: Improve Regex flag parsing #15345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Marcono1234
Copy link
Contributor

Similar to #15244, except for the Python Regex module

Fixes:

  • Flag a not being recognized
  • Syntax for disabling flags (-) not being recognized
  • Non-capturing group with flags erroneously containing : as literal

Any feedback is appreciated! Especially if I overlooked or failed to consider something, because I am not very familiar with the Python Regex library.

Without these changes parsing of inline flags fails for a and - and erroneously treats the remaining characters incorrectly, e.g. matching them literally. See the added code examples in regex/test.py for which parsing fails without these changes.

Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution. While this PR does add some valuable modelling, I am not comfortable with this change as it is.

If we look at the syntax for Python regular expressions, we see that there are two groups that can set modes:

  • (?aiLmsux), which changes the mode for the entire regular expression.
  • (?aiLmsux-imsx:...), which changes the mode only inside the group.

The current modelling only recognizes the first form, avoiding the complexity of local mode switching. With the modifications in this PR, particularly the predicate flag, local mode switches would affect the entire regular expression. Supporting local mode switching might be complicated and require changes to the libraries underpinning the regex queries, such as the ReDOS libraries.

Still this PR contains contributions that we could adopt right away:

  • we ignore the flag a which this PR adds support for. A PR just adding support for that would be easy to accept.
  • we currently do not support the second form very well: If it contains - it does not parse and if not, the contents get an extra : at the front. We could make it always parse and have the right content, but ignore the local modes. To do this, I would suggest treating it as a separate group via separate predicates, say local_flag_group_start and local_flag_group.

@Marcono1234
Copy link
Contributor Author

Thanks for the feedback!

For Java the Regex engine supports using groups at any location which turn on or off flags for the remainder of the Regex, for example a(?i)a. So I think with my changes for the Java CodeQL library the behavior there is still equally good / bad regarding mode handling. But please correct me if I am wrong.

But you are right that for Python the situation is different because the variant without : can only be used at the start of the Regex pattern. I will try to create a separate PR only for the a flag, and will then see if I also create a PR for the other changes.

@Marcono1234
Copy link
Contributor Author

Marcono1234 commented Jan 30, 2024

Support for the a (ASCII) flag was added in #15390. At the moment I am not planning to create a separate pull request to improve handling of groups with flags, since that might require some restructuring of the Regex library and some logic changes. I will therefore close this PR.

Thanks nonetheless for your feedback!

@Marcono1234 Marcono1234 deleted the marcono1234/python-regex-flags branch January 30, 2024 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants