-
Notifications
You must be signed in to change notification settings - Fork 886
Removed dependency on re.Scanner for compatability with IronPython. #465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Interesting. Didn't know re.Scanner wasn't available in IronPython. Would be interested in some benchmarks. How does this perform compared to re.Scanner on CPython? However, I'm still hesitant to accept this. I trust that re.Scanner is relatively bug free and I don't have to maintain it. However, this is an unknown and I would have to maintain it long term. Seems to me that if re.Scanner was that easy to re-implement, then IronPython should be doing that (I realize this might be a small subset -- but I haven't checked). After all, I would prefer to defer this upstream. And given that this is all in an extension, there is no reason you can't fork the extension (within the limits of the license) and maintain it for yourself and other IronPython users. In fact, in version 3.0 (coming soonish), all extensions will be separate packages. |
Oh and all other concerns aside, you would also need to address the flake8 errors raised in the tests. |
Yes, I have to agree that that this is really a IronPython issue. The issue is described here and is still open. From what I can understand a significant part of the regex modules are implemented in C code for CPython and in .NET code for IronPython, which probably explains the differences. I've had a look at the code for re.Scanner in CPython and I don't think I will be able to recreate it myself and just copying the CPython code just results in an exception. I have made a little script to benchmark the scanners. It looks like the simplified class is much quicker to create ( ~ 95%) and the scan method is a bit faster( ~ 0-15%). The difference in the scan function seem to change run-by-run and I do not know what is causing this. Of course, the scan method is the important part as this could get called multiple times. I'm not sure on how much scope there is for optimization, all I can say is so far I have made no efforts on doing that. Whether this sort of performance improvement is worth the maintenance I am not sure, so forking the extension might be best. The results I get from the benchmark script are as follows:
The script is shown below: import re
from timeit import default_timer as timer
### Internal Scanner
# Simplified replacement class of re.Scanner that appears in Cython,
# but not IronPython.
class InternalScanner:
# * Create a regex string that matches each phrase as a seperate
# group.
# * Create an array of action functions.
# * The index of each phrase's match group, is equal to the index
# of the corresponding action in the action array.
def __init__(self, lexicon):
self.actions = []
regex_string = ""
for (phrase, action) in lexicon:
if(regex_string != ""):
regex_string += "|"
regex_string += "(" + phrase + ")"
self.actions.append(action)
self.regex = re.compile(regex_string)
# Search each found match.
# Find the match group .
# Perform the corresponding action.
# and add to results list.
def scan(self, string):
result = []
for match in self.regex.finditer(string):
num = -1
for (group) in match.groups():
num += 1
if(group is not None and self.actions[num] is not None):
result.append(self.actions[num](group))
return result
### Helper functions
# Instance a scanner from stdlib with phrases as defined in markdown.
def CreateStdlibScanner():
try:
Scanner = re.Scanner
except AttributeError: # pragma: no cover
# must be on Python 2.4
from sre import Scanner
def _handle_double_quote(s, t):
k, v = t.split('=')
return k, v.strip('"')
def _handle_single_quote(s, t):
k, v = t.split('=')
return k, v.strip("'")
def _handle_key_value(s, t):
return t.split('=')
def _handle_word(s, t):
if t.startswith('.'):
return '.', t[1:]
if t.startswith('#'):
return 'id', t[1:]
return t, t
return Scanner([
(r'[^ ]+=".*?"', _handle_double_quote),
(r"[^ ]+='.*?'", _handle_single_quote),
(r'[^ ]+=[^ =]+', _handle_key_value),
(r'[^ =]+', _handle_word),
(r' ', None)
])
# Instance a scanner from internal with phrases as defined in markdown.
def CreateInternalScanner():
def _handle_double_quote(t):
k, v = t.split('=')
return k, v.strip('"')
def _handle_single_quote(t):
k, v = t.split('=')
return k, v.strip("'")
def _handle_key_value(t):
return t.split('=')
def _handle_word(t):
if t.startswith('.'):
return '.', t[1:]
if t.startswith('#'):
return 'id', t[1:]
return t, t
return InternalScanner([
(r'[^ ]+=".*?"', _handle_double_quote),
(r"[^ ]+='.*?'", _handle_single_quote),
(r'[^ ]+=[^ =]+', _handle_key_value),
(r'\=[^ ]+', None), # Ignore the case '=foo'.
(r'[^ =]+', _handle_word),
(r' ', None)
])
# Time how long `times` runs of `test_func` takes.
def benchmark(test_func, times):
start = timer()
for n in range(1,times):
test_func()
end = timer()
return end - start
# Writes a string comparing results.
def write_winner(stdlib_time, internal_time):
if(stdlib_time < internal_time):
print "StdLib wins! It is " + str(round((internal_time - stdlib_time) / stdlib_time * 100)) + "% faster."
if(internal_time < stdlib_time):
print "Internal wins! It is " + str(round((stdlib_time - internal_time) / stdlib_time * 100)) + "% faster."
### Benchmarks
# Print results from constructor benchmarks.
def time_ctor(number):
# Run Benchmarks
stdlib_time = benchmark(CreateStdlibScanner, number)
internal_time = benchmark(CreateInternalScanner, number)
# Output results
print "Time for " + str(number) + " runs of constructor."
print " * Stdlib : " + str(stdlib_time)
print " * Internal : " + str(internal_time)
write_winner(stdlib_time, internal_time)
# Print results from scan method benchmarks.
def time_scan(test_string, number):
# Create scanners.
stdlib_scanner = CreateStdlibScanner()
internal_scanner = CreateInternalScanner()
# Run Benchmarks
stdlib_time = benchmark(lambda : stdlib_scanner.scan(test_string), number)
internal_time = benchmark(lambda : internal_scanner.scan(test_string), number)
# Output results
print "Time for " + str(number) + " runs of .scan method on string: \n '" + test_string + "'"
print " * Stdlib : " + str(stdlib_time) + " s"
print " * Internal : " + str(internal_time) + " s"
write_winner(stdlib_time, internal_time)
# Run benchmarking functions.
time_ctor(200)
print "\n"
time_scan('#id1 .class1 id=id2 class="class2 class3" .class4', 10000)
print "\n" |
Hmm, the performance improvement is interesting. I didn't really expect that. However, it is not enough to sway me. This should either be addressed in a fork of the extension or by IronPython itself. If a fork of the extension ends up becoming wildly popular, then I may reconsider at that time. |
Hi, I am trying to get python-markdown working on IronPython. One issue is the dependency on re.Scanner in the attribute list extension, as this is not implemented in IronPython. I have removed this dependency and replaced it with a simplified class defined within the file.
I have tried copying CPythons re.Scanner class, but this does not play nice with IronPython. Therefore, I have implemented a class that tries to achieve the same as re.Scanner, that I have defined in attr_list.py. I have had to add another phrase-action pair, so that the case '=foo' is ignored by the new class.
I have also removed the scanner arguments from the action functions, as I do not know what these are for and my replacement class does not use them.
I cannot see any other differences in behavior to the original. All tests pass using CPython and the number of failures has gone from 23 to 9 in IronPython, and none are by caused by attr_list.py.
Cheers