-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IDLE: pyparse - simplify StringTranslatePseudoMapping #77121
Comments
Based on timing test on msg312733, StringTranslatePseudoMappingTest and a defaultdict have about the same performance, so this replaces the custom class with the stdlib functionality. |
Tal had written this on the original bpo-21765: |
I forget about this defaultdict behavior: "this value is inserted in the dictionary for the key, and returned." Reason: when default_factory returns a mutable, d[key] must return the same possibly mutated object with each call. I agree that defaultdict is not the right replacement. We need to pass to str.translate a dict that can be used by subscripting, newchar = d[char]. So partial(non-defaults.get, default_value) will not work. Instead, we need a __getitem__ that returns the same. In msg312444 I suggested simplifying STPM (including the name) because it has unneeded complexity. Remove the buggy .get override. Combine the _get stuff in __init__ (also removed) with current __getitem__ and simplify and we get what we actually need (untested, at yet). def __getitem__ Actually, we could hard-code the default value as 'X' as we never need anything else. How about ParseMap for the name? |
New PR submitted. Really glad I worked on this today. I learned a lot of things that were new to me. :-) |
I simplified ParseMap to a dict subclass with one override -- __getitem__ and then the tests. They run faster. I suspect translate is faster. |
For efficiency I suggest to initialize the mapping with dict.fromkeys(range(128), 'x') rather of an empty dict. It is also possible to use regular expressions: _trans = re.compile(r'''[^(){}\[]"'\\\n#]+''')
code = _trans.sub('x', code)
code = code.replace('{', '(')
code = code.replace('}', ')')
code = code.replace('[', '(')
code = code.replace(']', '(')
code = code.replace('\nx', '\n') I didn't check what way is more efficient. |
To me, it is plausible but not slam-dunk obvious that preloading ascii to 'x' mappings will make ascii lookup faster. On bpo-21765, where the pyparse special translation was a side-issue, Tal Einat claimed that the unpublished regex he tried was 100x slower. |
A similar regular expression version was mentioned on bpo-21765 and I had run some tests on it yesterday to verify. On my system, it ran at a factor of 10x slower, so if the translate finished in 0.003, the regex took 0.03. This was consistent for me, regardless of how big I made the document. The reason for not using a defauldict was to keep the 'x' mappings out of the dictionary so that it wouldn't grow and take up space. Although, I did realize yesterday that it wasn't really boundless because most values in source code would be ASCII. Running both the version the doesn't add the 'x' mappings and the |
I settled on the following to compare ParseMap implementations. from idlelib.pyparse import Parser
import timeit
class ParseGet(dict):
def __getitem__(self, key): return self.get(key, ord('x'))
class ParseMis(dict):
def __missing__(self, key): return ord('x')
for P in (ParseGet, ParseMis):
print(P.__name__, 'hit', 'miss')
p = p=P({i:i for i in (10, 34, 35, 39, 40, 41, 91, 92, 93, 123, 125)})
print(timeit.timeit(
"p[10],p[34],p[35],p[39],p[40],p[41],p[91],p[92],p[93],p[125]",
number=100000, globals = globals()))
print(timeit.timeit(
"p[11],p[33],p[36],p[45],p[50],p[61],p[71],p[82],p[99],p[125]",
number=100000, globals = globals())) ParseGet hit miss ParseGet hit miss Avoiding custom code for all ascii chars will be a win. I am sure that calling __missing__ for non-ascii will be at least as fast as it is presently. I will commit a revision tomorrow. I may then compare to Serhiy's sub/replace suggestion. My experiments with 'code.translate(tran)' indicate that time grows sub-linearly up to 1000 or 10000 chars. This suggests that there are significant constant or log-like terms. |
Don't use ord('x'). Use just 'x' or b'x'[0]. |
Replacing an expression with a less clear equivalent expression makes no sense to me. Anyway, having __missing__ return 120 reduces the benchmark miss time from 1.2-1.3 to .93, making ParseMis always faster than ParseGet and reducing the penalty for non-ascii chars. re.sub + str.replace is slower than translate import re
import timeit
class ParseMap(dict):
def __missing__(self, key): return 120 # ord('x')
trans = ParseMap((i,120) for i in range(128))
trans.update((ord(c), ord('(')) for c in "({[")
trans.update((ord(c), ord(')')) for c in ")}]")
trans.update((ord(c), ord(c)) for c in "\"'\\\n#")
trans_re = re.compile(r'''[^(){}\[]"'\\\n#]+''')
code='\t a([{b}])b"c\'d\n'*1000 # n = 1, 10, 100, 1000 print(timeit.timeit( n trans re Multiply by 100 to get microseconds or seconds for 1000000. |
The mapping passed to str.translate must map ints representing codepoints to either either ints or strings. Translate can extract binary codepoints for the new string from either. Ints are slightly faster, so I am inclined not to switch. import timeit
class ParseMapN(dict):
def __missing__(self, key): return 120
class ParseMapS(dict):
def __missing__(self, key): return 'x'
trans1 = ParseMapN.fromkeys(range(128), 120)
trans1.update((ord(c), ord('(')) for c in "({[")
trans1.update((ord(c), ord(')')) for c in ")}]")
trans1.update((ord(c), ord(c)) for c in "\"'\\\n#")
trans2 = ParseMapN.fromkeys(range(128), 'x')
trans2.update((ord(c), '(') for c in "({[")
trans2.update((ord(c), ')') for c in ")}]")
trans2.update((ord(c), c) for c in "\"'\\\n#")
for i in (1, 10, 100, 1000):
code = '\t a([{b}])b"c\'d\n' * i
print('N ', i)
print(timeit.repeat(
'code.translate(trans1)',
number=10000, globals = globals()))
print('S ', i)
print(timeit.repeat(
'code.translate(trans2)',
number=10000, globals = globals()))
N 1 [0.056562504, 0.056747570, 0.05654651, 0.056460749, 0.056428776]
S 1 [0.060795346, 0.062304155, 0.061043432, 0.062349345, 0.061191301] N 10 [0.076474600, 0.076463227, 0.076560984, 0.076581582, 0.076010091] N 100 [0.28529922, 0.28383868, 0.283949046, 0.284461512, 0.284291203] N1000 [2.23882157, 2.2383192, 2.2384120, 2.2377972, 2.23854982] The pattern of all S repeats being greater than all corresponding N repeats was true for 2 other runs. |
Are there more work to be done here, or can we close this issue? |
As near as I can tell now, I tested all proposed alternatives before committing. So all done. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: