Regex object should have introspection methods #57079

mattchaput · 2011-08-31T17:29:34Z

BPO	12870
Nosy	@ezio-melotti, @merwok

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2011-08-31.17:29:33.729>
labels = ['expert-regex', 'type-feature']
title = 'Regex object should have introspection methods'
updated_at = <Date 2017-05-06.14:52:30.063>
user = 'https://bugs.python.org/mattchaput'

bugs.python.org fields:

activity = <Date 2017-05-06.14:52:30.063>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Regular Expressions']
creation = <Date 2011-08-31.17:29:33.729>
creator = 'mattchaput'
dependencies = []
files = []
hgrepos = []
issue_num = 12870
keywords = []
message_count = 7.0
messages = ['143266', '143268', '143662', '143681', '143686', '143689', '143696']
nosy_count = 4.0
nosy_names = ['ezio.melotti', 'eric.araujo', 'mrabarnett', 'mattchaput']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'pending'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue12870'
versions = ['Python 3.4']

mattchaput · 2011-08-31T17:29:33Z

Several times in the recent past I've wished for the following methods on the regular expression object. These would allow me to speed up search and parsing code, by limiting the number of regex matches I need to try.

literal_prefix(): Returns any literal string at the start of the pattern (before any "special" parts). E.g., for the pattern "ab(c|d)ef" the method would return "ab". For the pattern "abc|def" the method would return "". When matching a regex against keys in a btree, this would let me limit the search to just the range of keys with the prefix.

first_chars(): Returns a string/list/set/whatever of the possible first characters that could appear at the start of a matching string. E.g. for the pattern "ab(c|d)ef" the method would return "a". For the pattern "[a-d]ef" the method would return "abcd". When parsing a string with regexes, this would let me only have to test the regexes that could match at the current character.

As long as you're making a new regex package, I thought I'd put in a request for these :)

ezio-melotti · 2011-08-31T17:54:32Z

These additions sounds more useful as an external tool than regex functions/methods. There are already a few tools able to "explain" what a regex matches.
The use cases you proposed are too specific to deserve new methods, and using them programmatically sounds like premature optimization IMHO.

mattchaput · 2011-09-07T05:03:04Z

Ezio, no offense, but I think it's safe to say you've completely misunderstood this bug. It is not about "explaining what a regex matches" or optimizing the regex. Read the last sentences of the two paragraphs explaining the proposed methods for the use cases. This is about allowing MY CODE to programmatically get certain information about a regex object to allow it to limit the number of times it has to call regex.match(). AFAIK there's no good way to get this information about a regex object without adding these methods or building my own pure-Python regex interpreter, which would be both Herculean and pointless.

ezio-melotti · 2011-09-07T14:34:21Z

Limiting the number of calls to re.match sounds like an optimization to me, and I still think that the methods you proposed are too specific.

merwok · 2011-09-07T14:42:38Z

I tend to agree with Ezio. Matt, maybe you could ask for other opinions on python-ideas?

mattchaput · 2011-09-07T15:22:29Z

Yes, it's an optimization of my code, not the regex, as I said. Believe me, it's not premature. I've listed two general use cases for the two methods. To me it seems obvious that having to test a large number of regexes against a string, and having to test a single regex against a large number of strings, are two very common programming tasks, and they could both be speeded up quite a bit using these methods.

As of now my parsing code and other code such as PyParsing are resorting to hacks like requiring the user to manually specify the possible first chars of a regex at configuration. With the hacks, the code can be hundreds of times faster. But the hacks are error-prone and should be unnecessary.

The PCRE library implements at least the "first char" functionality, and a lot more regex introspection that would be useful, through its pcre_fullinfo() function.

ezio-melotti · 2011-09-07T16:32:35Z

If there is a generic introspection method like the pcre_fullinfo you mentioned, and if it's also useful and used with other languages/libraries, then it might be considered.

serhiy-storchaka · 2022-04-18T14:36:42Z

It is an old issue and I agree with @ezio-melotti that this request looks not too useful outside your specific program.

It's also not very well thought out:

What first_chars() should return for [^a]? A set of sys.maxunicode characters?
What first_chars() should return for .(?<=a)? What if the re module compiler become so advanced in future Python versions that it will be able to optimize it to a? The result will depend on the optimization level.
What literal_prefix() should return for apple|application?
What literal_prefix() should return for (?i)33 cows?

There are thousands of such questions, and different answers can make sense for different applications.

mattchaput mannequin added topic-regex type-feature A feature request or enhancement labels Aug 31, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

serhiy-storchaka added the pending The issue will be closed if no feedback is provided label Apr 18, 2022

iritkatriel closed this as completed May 15, 2022

AlexWaygood removed the pending The issue will be closed if no feedback is provided label May 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex object should have introspection methods #57079

Regex object should have introspection methods #57079

mattchaput mannequin commented Aug 31, 2011

mattchaput mannequin commented Aug 31, 2011

ezio-melotti commented Aug 31, 2011

mattchaput mannequin commented Sep 7, 2011

ezio-melotti commented Sep 7, 2011

merwok commented Sep 7, 2011

mattchaput mannequin commented Sep 7, 2011

ezio-melotti commented Sep 7, 2011

serhiy-storchaka commented Apr 18, 2022

Regex object should have introspection methods #57079

Regex object should have introspection methods #57079

Comments

mattchaput mannequin commented Aug 31, 2011

mattchaput mannequin commented Aug 31, 2011

ezio-melotti commented Aug 31, 2011

mattchaput mannequin commented Sep 7, 2011

ezio-melotti commented Sep 7, 2011

merwok commented Sep 7, 2011

mattchaput mannequin commented Sep 7, 2011

ezio-melotti commented Sep 7, 2011

serhiy-storchaka commented Apr 18, 2022