Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache unified_diff function #506

Closed
wants to merge 5 commits into from
Closed

Cache unified_diff function #506

wants to merge 5 commits into from

Conversation

mborsetti
Copy link
Contributor

When sending html email the unified_diff function is called twice: one to generate body_text, and then again when generating body_html since the output from body_text is not used to generate body_html. Furthermore, if the output to console is enabled the function is called a third time.

While refactoring the code to avoid such wastage would be ideal, this one liner should speed things up by itself.

Copy link
Owner

@thp thp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this do the right thing? Is the parameter hashable so that lru_cache does the right thing? Is there a performance issue with unified_diff() that requires us to cache the data?

@mborsetti
Copy link
Contributor Author

Of course it isn't hashable, so it's doing nothing!

Yes, difflib.unified_diff() can be extremely costly (in the order of several seconds) on huge corpuses, and everything helps.

Building in-code caching of the diff results is the only solution; any suggestions on how to architect it (there is so much nesting...)?

@thp
Copy link
Owner

thp commented Jul 10, 2020

Did you compare the performance of difflib.unified_diff() with using e.g. an external diff tool (diff -u)?

If difflib.unified_diff() is the problem, then you could move this block into its own global function:

        return ''.join(difflib.unified_diff(job_state.old_data.splitlines(keepends=True),
                                            job_state.new_data.splitlines(keepends=True),
                                            '@', '@', timestamp_old, timestamp_new))

Something like that:

@functools.lru_cache(maxsize=1)
def _unified_diff(old_data, new_data, timestamp_old, timestamp_new):
    return ''.join(difflib.unified_diff(old_data.splitlines(keepends=True),
                                        new_data.splitlines(keepends=True),
                                        '@', '@', timestamp_old, timestamp_new))

Of course, you could also encapsulate this into a DiffResult object that gets created based on the job_state and pass this diff result around to every code part that needs it (or implement lazy evalulation as part of the job, i.e. job.get_diff_result() and that calculates the diff the first time and caches the result).

    def get_diff_result(self):
        if self._diff_result is None:
            self._diff_result = calculate_expensive_diff(self)
        return self._diff_result

@mborsetti
Copy link
Contributor Author

Thanks! I was mentally stuck; it all makes sense.

Turning the function call into a cached global function did the trick:

CacheInfo(hits=0, misses=1, maxsize=1, currsize=1)
CacheInfo(hits=1, misses=1, maxsize=1, currsize=1)
CacheInfo(hits=2, misses=1, maxsize=1, currsize=1)

Good to go!

@thp
Copy link
Owner

thp commented Jul 11, 2020

Thanks for the update. Needs rebasing, as there are merge conflicts now. Have you measured and compared the performance of using the cache and not using the cache?

@mborsetti
Copy link
Contributor Author

Have now, and concluded once again that memoization works!

Here you go:

  1. I artificially made a diff simulating a new entry being added to a corpus I am tracking by modifying the keys of a job:

Before:

name: 2020 in science
url: https://en.wikipedia.org/wiki/2020_in_science
filter:
- html2text:
    method: pyhtml2text
    unicode_snob: true
    body_width: 0
    single_line_break: true
    ignore_images: true
- grepi: 10 July – Astronomers announce the discovery of the
- grepi: '892. /*/*'
- grepi: '893. /*/*'
- grepi: '894. /*/*'

After:

name: 2020 in science
url: https://en.wikipedia.org/wiki/2020_in_science
filter:
- html2text:
    method: pyhtml2text
    unicode_snob: true
    body_width: 0
    single_line_break: true
    ignore_images: true
  1. I have enabled reporters stdout and email with html: true.

  2. I have timed the call to cached_unified_diff with the following code:

from timeit import default_timer
start =  default_timer()
txt = cached_unified_diff(job_state.old_data, job_state.new_data, timestamp_old, timestamp_new, job_state.job.comparison_filter)
print(f'unified_diff duration {((default_timer() - start)) * 1e3:.2f} milliseconds')

The results are:

With caching turned on (average of 3 runs):

unified_diff duration 6.88 milliseconds
unified_diff duration 6.16 milliseconds
unified_diff duration 5.93 milliseconds

Without caching (i.e. removed @functools.lru_cache(maxsize=1) decorator from the function) (average of 3 runs):

unified_diff duration 7.02 milliseconds
unified_diff duration 0.00 milliseconds
unified_diff duration 0.00 milliseconds

There's some jitter in the numbers, so the 6.88 of the first runs and the 7.02 of the second runs are comparable.

TOTALS: caching reduces total time from 19 ms to 7 ms in this one real-world example. The improvement will of course be larger when additional reporters are enabled (e.g. slack in addition to email) or more complex corpuses and/or changes are being diffed.

P.S. Interestingly (to me) there also appear to be some built-in memoization in Python as with caching turned off calling the same function with the same parameters again does reduce the length of its execution as seen from the first set of numbers.

@mborsetti
Copy link
Contributor Author

Rebased.

This is the last work I'll do on this improvement (other thank forking the project if needed).

@thp
Copy link
Owner

thp commented Jul 15, 2020

Closing in favor of #527, let's continue the discussion there.

@thp thp closed this Jul 15, 2020
@mborsetti mborsetti deleted the cache_unified_diff branch July 19, 2020 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants