-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preparation for advanced set syntax in regular expressions #74534
Comments
Currently the re module supports only simple sets. They can include literal characters, character ranges, some simple character classes and support the negation. The Unicode standard [1] defines set operations (union, intersection, difference and symmetric difference) and nested sets. Some regular expression engines implemented these features, for example the regex module supports all TR18 features except not-nested POSIX character classes. If replace the re module with the regex module or add support of these features in the re module and make this syntax enabled by default, this will break some code. It is very unlikely the the regular expression contains duplicated characters ('--', '||', '&&' or '~~'), but nested sets uses just '[', and non-escaped '[' is occurred in character sets in regular expressions (even the stdlib contains several occurrences). Proposed patch adds FutureWarnings emitted when possible breaking set construct ('--', '||', '&&', '~~' or '[') is occurred in a regular expression. We need one or two releases with a warning before changing syntax. The patch also makes re.escape() escaping '&' and '~' and fixes several regular expression in the stdlib. Alternatively the support of new set syntax could be enabled by special flag. I'm not sure that the support of set operations and nested sets is necessary. This complicates the syntax of regular expressions (which already is not simple). Currently set operations can be emulated with lookarounds: [set1||set2] -- (?:[set1]|[set2]) [1] http://unicode.org/reports/tr18/#Subtraction_and_Intersection |
Made a warning for '[' be emitted only at the start of a set. This significantly decrease the breakage of other code. I think we can get around without implicit union of nested sets, like in [[0-9][:Latin:]]. This can be written as [||[0-9]||[:Latin:]]. |
It might be worth adding part of the problematic regex to the warning message. For Django's tests, I see an error like "FutureWarning: Possible nested set at position 17 return re.compile(res).match". It took some effort to track down the source. A partial traceback is: As an aside, I'm not sure how to fix the warning in Django. It comes from the test added in django/django@98df288 where a path like 'tests/fixtures/fixtures/fixture_with[special]chars' is run through glob.escape() which creates 'tests/fixtures/fixtures/fixture_with[[]special]chars'. |
Good catch! fnmatch.translate() can produce a pattern which emits a warning when compiled. Could you please open a separate issue for this? |
Okay, I created bpo-32775. |
FWIW, this warning is annoying because it is hard to fix in the case where the regex are source from data: the warning message does not include the regex at fault; it should otherwise the warning is noisy and ineffective IMHO. |
Sorry, my comment was at best nonsensical gibberish! I meant to say that this warning message should include the actual regex at fault; otherwise it is hard to fix when the regex in question comes from some data structure like a list; then the line number where the warning occurs is not enough to fix the issue; the code needs to be instrumented first to catch warning which is rather heavy handed to handle a warning. |
Annoying yak shave: - I have an ecosystem cross-reference unit test for osv.dev in the works, which relies on using the regex in the schema definition, verbatim. - [Python 3.7 has a FutureWarning on set constructs](https://docs.python.org/dev/whatsnew/3.7.html#re), where as far as I can tell, `(:[[:digit:]]+)?` would need to be written as `(:\[[:digit:]]+)?`, which isn't valid in the JSON Schema (more in python/cpython#74534) Signed-off-by: Andrew Pollock <apollock@google.com>
Annoying yak shave: - I have an ecosystem cross-reference unit test for osv.dev in the works, which relies on using the regex in the schema definition, verbatim. - [Python 3.7 has a FutureWarning on set constructs](https://docs.python.org/dev/whatsnew/3.7.html#re), where as far as I can tell, `(:[[:digit:]]+)?` would need to be written as `(:\[[:digit:]]+)?`, which isn't valid in the JSON Schema (more in python/cpython#74534) - https://json-schema.org/draft/2020-12/json-schema-core#name-regular-expressions says JSON Schema regex is ECMA-262, so what gets used in the schema needs to be the subset that works with https://docs.python.org/3/library/re.html - https://xkcd.com/1171/ --------- Signed-off-by: Andrew Pollock <apollock@google.com> Signed-off-by: Andrew Pollock <andrewpollock@users.noreply.github.com>
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: