New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache unified_diff function #506
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this do the right thing? Is the parameter hashable so that lru_cache
does the right thing? Is there a performance issue with unified_diff()
that requires us to cache the data?
Of course it isn't hashable, so it's doing nothing! Yes, Building in-code caching of the diff results is the only solution; any suggestions on how to architect it (there is so much nesting...)? |
Did you compare the performance of If
Something like that:
Of course, you could also encapsulate this into a
|
Thanks! I was mentally stuck; it all makes sense. Turning the function call into a cached global function did the trick:
Good to go! |
Thanks for the update. Needs rebasing, as there are merge conflicts now. Have you measured and compared the performance of using the cache and not using the cache? |
Have now, and concluded once again that memoization works! Here you go:
Before: name: 2020 in science
url: https://en.wikipedia.org/wiki/2020_in_science
filter:
- html2text:
method: pyhtml2text
unicode_snob: true
body_width: 0
single_line_break: true
ignore_images: true
- grepi: 10 July – Astronomers announce the discovery of the
- grepi: '892. /*/*'
- grepi: '893. /*/*'
- grepi: '894. /*/*' After: name: 2020 in science
url: https://en.wikipedia.org/wiki/2020_in_science
filter:
- html2text:
method: pyhtml2text
unicode_snob: true
body_width: 0
single_line_break: true
ignore_images: true
from timeit import default_timer
start = default_timer()
txt = cached_unified_diff(job_state.old_data, job_state.new_data, timestamp_old, timestamp_new, job_state.job.comparison_filter)
print(f'unified_diff duration {((default_timer() - start)) * 1e3:.2f} milliseconds') The results are: With caching turned on (average of 3 runs):
Without caching (i.e. removed
There's some jitter in the numbers, so the 6.88 of the first runs and the 7.02 of the second runs are comparable. TOTALS: caching reduces total time from 19 ms to 7 ms in this one real-world example. The improvement will of course be larger when additional reporters are enabled (e.g. slack in addition to email) or more complex corpuses and/or changes are being diffed. P.S. Interestingly (to me) there also appear to be some built-in memoization in Python as with caching turned off calling the same function with the same parameters again does reduce the length of its execution as seen from the first set of numbers. |
Rebased. This is the last work I'll do on this improvement (other thank forking the project if needed). |
Closing in favor of #527, let's continue the discussion there. |
When sending html email the
unified_diff
function is called twice: one to generatebody_text
, and then again when generatingbody_html
since the output frombody_text
is not used to generatebody_html
. Furthermore, if the output to console is enabled the function is called a third time.While refactoring the code to avoid such wastage would be ideal, this one liner should speed things up by itself.