Preparation for advanced set syntax in regular expressions #74534

serhiy-storchaka · 2017-05-12T08:14:13Z

BPO	30349
Nosy	@rhettinger, @ezio-melotti, @bitdancer, @serhiy-storchaka, @timgraham, @pombredanne
PRs	bpo-30349: Raise FutureWarning for nested sets and set operations #1553

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2017-11-16.10:39:01.603>
created_at = <Date 2017-05-12.08:14:13.370>
labels = ['expert-regex', 'type-feature', 'library', '3.7']
title = 'Preparation for advanced set syntax in regular expressions'
updated_at = <Date 2021-09-21.09:51:26.713>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2021-09-21.09:51:26.713>
actor = 'pombredanne'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2017-11-16.10:39:01.603>
closer = 'serhiy.storchaka'
components = ['Library (Lib)', 'Regular Expressions']
creation = <Date 2017-05-12.08:14:13.370>
creator = 'serhiy.storchaka'
dependencies = []
files = []
hgrepos = []
issue_num = 30349
keywords = []
message_count = 8.0
messages = ['293532', '303757', '306349', '311682', '311684', '311688', '402299', '402303']
nosy_count = 7.0
nosy_names = ['rhettinger', 'ezio.melotti', 'mrabarnett', 'r.david.murray', 'serhiy.storchaka', 'Tim.Graham', 'pombredanne']
pr_nums = ['1553']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue30349'
versions = ['Python 3.7']

serhiy-storchaka · 2017-05-12T08:14:13Z

Currently the re module supports only simple sets. They can include literal characters, character ranges, some simple character classes and support the negation. The Unicode standard [1] defines set operations (union, intersection, difference and symmetric difference) and nested sets. Some regular expression engines implemented these features, for example the regex module supports all TR18 features except not-nested POSIX character classes.

If replace the re module with the regex module or add support of these features in the re module and make this syntax enabled by default, this will break some code. It is very unlikely the the regular expression contains duplicated characters ('--', '||', '&&' or '~~'), but nested sets uses just '[', and non-escaped '[' is occurred in character sets in regular expressions (even the stdlib contains several occurrences).

Proposed patch adds FutureWarnings emitted when possible breaking set construct ('--', '||', '&&', '~~' or '[') is occurred in a regular expression. We need one or two releases with a warning before changing syntax. The patch also makes re.escape() escaping '&' and '~' and fixes several regular expression in the stdlib.

Alternatively the support of new set syntax could be enabled by special flag.

I'm not sure that the support of set operations and nested sets is necessary. This complicates the syntax of regular expressions (which already is not simple). Currently set operations can be emulated with lookarounds:

[set1||set2] -- (?:[set1]|[set2])
[set1&&set2] -- set1 or (?=[set1])[set2]
[set1--set2] -- set1 or set1 or (?=[set1])[^set2]
[set1~~set2] -- recursively expand [[set1||set2]--[set1&&set2]]

[1] http://unicode.org/reports/tr18/#Subtraction_and_Intersection

serhiy-storchaka · 2017-10-05T10:26:53Z

Made a warning for '[' be emitted only at the start of a set. This significantly decrease the breakage of other code. I think we can get around without implicit union of nested sets, like in [[0-9][:Latin:]]. This can be written as [||[0-9]||[:Latin:]].

serhiy-storchaka · 2017-11-16T10:38:34Z

New changeset 05cb728 by Serhiy Storchaka in branch 'master':
bpo-30349: Raise FutureWarning for nested sets and set operations (bpo-1553)
05cb728

timgraham · 2018-02-05T19:23:15Z

It might be worth adding part of the problematic regex to the warning message. For Django's tests, I see an error like "FutureWarning: Possible nested set at position 17 return re.compile(res).match". It took some effort to track down the source.

A partial traceback is:
File "/home/tim/code/django/django/core/management/commands/loaddata.py", line 247, in find_fixtures
for candidate in glob.iglob(glob.escape(path) + '*'):
File "/home/tim/code/cpython/Lib/glob.py", line 72, in _iglob
for name in glob_in_dir(dirname, basename, dironly):
File "/home/tim/code/cpython/Lib/glob.py", line 83, in _glob1
return fnmatch.filter(names, pattern)
File "/home/tim/code/cpython/Lib/fnmatch.py", line 52, in filter
match = _compile_pattern(pat)
File "/home/tim/code/cpython/Lib/fnmatch.py", line 46, in _compile_pattern
return re.compile(res).match
File "/home/tim/code/cpython/Lib/re.py", line 240, in compile
return _compile(pattern, flags)
File "/home/tim/code/cpython/Lib/re.py", line 292, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/tim/code/cpython/Lib/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/home/tim/code/cpython/Lib/sre_parse.py", line 930, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/home/tim/code/cpython/Lib/sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "/home/tim/code/cpython/Lib/sre_parse.py", line 816, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "/home/tim/code/cpython/Lib/sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "/home/tim/code/cpython/Lib/sre_parse.py", line 524, in _parse
FutureWarning, stacklevel=nested + 6
FutureWarning: Possible nested set at position 17

As an aside, I'm not sure how to fix the warning in Django. It comes from the test added in django/django@98df288 where a path like 'tests/fixtures/fixtures/fixture_with[special]chars' is run through glob.escape() which creates 'tests/fixtures/fixtures/fixture_with[[]special]chars'.

serhiy-storchaka · 2018-02-05T19:43:07Z

Good catch! fnmatch.translate() can produce a pattern which emits a warning when compiled. Could you please open a separate issue for this?

timgraham · 2018-02-05T20:08:17Z

Okay, I created bpo-32775.

pombredanne · 2021-09-21T09:23:52Z

FWIW, this warning is annoying because it is hard to fix in the case where the regex are source from data: the warning message does not include the regex at fault; it should otherwise the warning is noisy and ineffective IMHO.

pombredanne · 2021-09-21T09:51:27Z

Sorry, my comment was at best nonsensical gibberish!

I meant to say that this warning message should include the actual regex at fault; otherwise it is hard to fix when the regex in question comes from some data structure like a list; then the line number where the warning occurs is not enough to fix the issue; the code needs to be instrumented first to catch warning which is rather heavy handed to handle a warning.

serhiy-storchaka added the 3.7 (EOL) end of life label May 12, 2017

serhiy-storchaka self-assigned this May 12, 2017

serhiy-storchaka added stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement labels May 12, 2017

serhiy-storchaka closed this as completed Nov 16, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preparation for advanced set syntax in regular expressions #74534

Preparation for advanced set syntax in regular expressions #74534

serhiy-storchaka commented May 12, 2017

serhiy-storchaka commented May 12, 2017

serhiy-storchaka commented Oct 5, 2017

serhiy-storchaka commented Nov 16, 2017

timgraham mannequin commented Feb 5, 2018

serhiy-storchaka commented Feb 5, 2018

timgraham mannequin commented Feb 5, 2018

pombredanne mannequin commented Sep 21, 2021

pombredanne mannequin commented Sep 21, 2021

Preparation for advanced set syntax in regular expressions #74534

Preparation for advanced set syntax in regular expressions #74534

Comments

serhiy-storchaka commented May 12, 2017

serhiy-storchaka commented May 12, 2017

serhiy-storchaka commented Oct 5, 2017

serhiy-storchaka commented Nov 16, 2017

timgraham mannequin commented Feb 5, 2018

serhiy-storchaka commented Feb 5, 2018

timgraham mannequin commented Feb 5, 2018

pombredanne mannequin commented Sep 21, 2021

pombredanne mannequin commented Sep 21, 2021