Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize amz_norm_whitespace #35

Merged
merged 1 commit into from
Mar 4, 2021
Merged

Optimize amz_norm_whitespace #35

merged 1 commit into from
Mar 4, 2021

Conversation

noamkush
Copy link
Contributor

While profiling I noticed that requests-aws4auth spends 2/3 of the time in amz_norm_whitespace while many of the strings that are passed to it don't contain any whitespace. Checking for whitespace using a regex before trying to replace it turned out to be much faster than running shlex every time.

@tedder
Copy link
Owner

tedder commented Mar 3, 2021

I did some testing, and I'm going to reject this because it is much worse in some scenarios.

#!/usr/bin/python3

import timeit


def run(run_count=10000, _repeat=1):
  print(run_count, _repeat)
  test_str = 'Vice lyft pitchfork, poke wayfarers flexitarian lumbersexual taxidermy twee bespoke la croix listicle. Tote bag cloud bread XOXO yuccie crucifix, fam messenger bag letterpress cronut subway tile leggings small batch. Freegan shaman tattooed, gluten-free pinterest chambray whatever. Put a bird on it small batch franzen, readymade cardigan four dollar toast marfa heirloom cold-pressed YOLO.'
  t = timeit.repeat(stmt='''
if re.search('\s', text):
  f = ' '.join(shlex.split(text, posix=False))
''', setup=f'''
import re
import shlex
text = '{test_str}' * {_repeat}
''', number=run_count, repeat=3)
  print(f"   spaces, re and shlex: {t}")
  
  t = timeit.repeat(stmt='''
f = ' '.join(shlex.split(text, posix=False))
''', setup=f'''
import shlex
text = '{test_str}' * {_repeat}
''', number=run_count, repeat=3)
  print(f"   spaces,        shlex: {t}")
  t = timeit.repeat(stmt='''
if re.search('\s', text):
  f = ' '.join(shlex.split(text, posix=False))
''', setup=f'''
import re
import shlex
text = '{test_str}'.replace(' ', '') * {_repeat}
''', number=run_count, repeat=3)
  print(f"no spaces, re and shlex: {t}")
  
  t = timeit.repeat(stmt='''
f = ' '.join(shlex.split(text, posix=False))
''', setup=f'''
import shlex
text = '{test_str}'.replace(' ', '') * {_repeat}
''', number=run_count, repeat=3)
  print(f"no spaces,        shlex: {t}")

run(_repeat=1)
print("*** again with length multiplier")
run(500, 1024)
$ python3 rex_time.py 
10000 1
   spaces, re and shlex: [2.3800748139619827, 2.653704177006148, 2.585710724000819]
   spaces,        shlex: [2.2576327560236678, 2.394987210049294, 2.31907087599393]
no spaces, re and shlex: [0.01793892600107938, 0.017424626974388957, 0.01678658206947148]
no spaces,        shlex: [2.1401367279468104, 1.9101750730769709, 1.9226093819597736]
*** again with length multiplier
500 1024
   spaces, re and shlex: [118.54910042299889, 121.399052057066, 122.54422497900669]
   spaces,        shlex: [121.31505271594506, 122.77670257701539, 129.68882209795993]
no spaces, re and shlex: [0.659932279959321, 0.6706405270379037, 0.6688751910114661]
no spaces,        shlex: [1140.910614454071, 1158.019293964957, 1188.4628928540042]

For relatively short strings (and high iterations) the overhead is very similar for "shlex with regex" versus "shlex without regex". There's one exception that is very fast: strings with zero spaces. But the other three are very close. Removing the regex generally speeds things up by 5-20%.

Now, when it gets to a long string with spaces, the results are similar to the short strings. (removing the regex actually slows it down slightly). But long strings without spaces is what seals the deal: shlex is drastically slower.

So, TLDR, removing the regex speeds up the best cases by 11%. But it slows down the worst cases by a factor of 10 compared to long strings with spaces. Like many algorithms, the worst-case is the biggest problem.

@tedder tedder closed this Mar 3, 2021
@noamkush
Copy link
Contributor Author

noamkush commented Mar 4, 2021

Huh? According to your own numbers without whitespace my version is 117 to 1600 times faster (this is not a percentage, it's a multiplier) and practically the same with whitespace. You said

shlex is drastically slower.

So why are you not merging this? In what scenario was it much worse?

@tedder tedder reopened this Mar 4, 2021
@tedder
Copy link
Owner

tedder commented Mar 4, 2021

Aha- I misremembered the diff. I thought you were removing the regex.

@tedder tedder merged commit 31c09f8 into tedder:master Mar 4, 2021
tedder added a commit that referenced this pull request Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants