Cache unified_diff function #506

mborsetti · 2020-05-24T05:16:38Z

When sending html email the unified_diff function is called twice: one to generate body_text, and then again when generating body_html since the output from body_text is not used to generate body_html. Furthermore, if the output to console is enabled the function is called a third time.

While refactoring the code to avoid such wastage would be ideal, this one liner should speed things up by itself.

thp

Will this do the right thing? Is the parameter hashable so that lru_cache does the right thing? Is there a performance issue with unified_diff() that requires us to cache the data?

mborsetti · 2020-07-08T22:57:49Z

Of course it isn't hashable, so it's doing nothing!

Yes, difflib.unified_diff() can be extremely costly (in the order of several seconds) on huge corpuses, and everything helps.

Building in-code caching of the diff results is the only solution; any suggestions on how to architect it (there is so much nesting...)?

thp · 2020-07-10T09:17:06Z

Did you compare the performance of difflib.unified_diff() with using e.g. an external diff tool (diff -u)?

If difflib.unified_diff() is the problem, then you could move this block into its own global function:

        return ''.join(difflib.unified_diff(job_state.old_data.splitlines(keepends=True),
                                            job_state.new_data.splitlines(keepends=True),
                                            '@', '@', timestamp_old, timestamp_new))

Something like that:

@functools.lru_cache(maxsize=1)
def _unified_diff(old_data, new_data, timestamp_old, timestamp_new):
    return ''.join(difflib.unified_diff(old_data.splitlines(keepends=True),
                                        new_data.splitlines(keepends=True),
                                        '@', '@', timestamp_old, timestamp_new))

Of course, you could also encapsulate this into a DiffResult object that gets created based on the job_state and pass this diff result around to every code part that needs it (or implement lazy evalulation as part of the job, i.e. job.get_diff_result() and that calculates the diff the first time and caches the result).

    def get_diff_result(self):
        if self._diff_result is None:
            self._diff_result = calculate_expensive_diff(self)
        return self._diff_result

mborsetti · 2020-07-11T02:37:35Z

Thanks! I was mentally stuck; it all makes sense.

Turning the function call into a cached global function did the trick:

CacheInfo(hits=0, misses=1, maxsize=1, currsize=1)
CacheInfo(hits=1, misses=1, maxsize=1, currsize=1)
CacheInfo(hits=2, misses=1, maxsize=1, currsize=1)

Good to go!

thp · 2020-07-11T19:32:15Z

Thanks for the update. Needs rebasing, as there are merge conflicts now. Have you measured and compared the performance of using the cache and not using the cache?

mborsetti · 2020-07-12T02:47:07Z

Have now, and concluded once again that memoization works!

Here you go:

I artificially made a diff simulating a new entry being added to a corpus I am tracking by modifying the keys of a job:

Before:

name: 2020 in science
url: https://en.wikipedia.org/wiki/2020_in_science
filter:
- html2text:
    method: pyhtml2text
    unicode_snob: true
    body_width: 0
    single_line_break: true
    ignore_images: true
- grepi: 10 July – Astronomers announce the discovery of the
- grepi: '892. /*/*'
- grepi: '893. /*/*'
- grepi: '894. /*/*'

After:

name: 2020 in science
url: https://en.wikipedia.org/wiki/2020_in_science
filter:
- html2text:
    method: pyhtml2text
    unicode_snob: true
    body_width: 0
    single_line_break: true
    ignore_images: true

I have enabled reporters stdout and email with html: true.
I have timed the call to cached_unified_diff with the following code:

from timeit import default_timer
start =  default_timer()
txt = cached_unified_diff(job_state.old_data, job_state.new_data, timestamp_old, timestamp_new, job_state.job.comparison_filter)
print(f'unified_diff duration {((default_timer() - start)) * 1e3:.2f} milliseconds')

The results are:

With caching turned on (average of 3 runs):

unified_diff duration 6.88 milliseconds
unified_diff duration 6.16 milliseconds
unified_diff duration 5.93 milliseconds

Without caching (i.e. removed @functools.lru_cache(maxsize=1) decorator from the function) (average of 3 runs):

unified_diff duration 7.02 milliseconds
unified_diff duration 0.00 milliseconds
unified_diff duration 0.00 milliseconds

There's some jitter in the numbers, so the 6.88 of the first runs and the 7.02 of the second runs are comparable.

TOTALS: caching reduces total time from 19 ms to 7 ms in this one real-world example. The improvement will of course be larger when additional reporters are enabled (e.g. slack in addition to email) or more complex corpuses and/or changes are being diffed.

P.S. Interestingly (to me) there also appear to be some built-in memoization in Python as with caching turned off calling the same function with the same parameters again does reduce the length of its execution as seen from the first set of numbers.

…nified_diff

mborsetti · 2020-07-12T02:48:55Z

Rebased.

This is the last work I'll do on this improvement (other thank forking the project if needed).

thp · 2020-07-15T08:49:21Z

Closing in favor of #527, let's continue the discussion there.

Cache unified_diff function

9df2cd8

thp reviewed Jul 7, 2020

View reviewed changes

mborsetti added 3 commits July 11, 2020 10:23

Merge branch 'master' into cache_unified_diff

e4aa26c

Fixed caching (it now works, ahem)

2745dcb

Removed testing instrumentation

41eefbf

Merge branch 'master' of https://github.com/thp/urlwatch into cache_u…

9d9f394

…nified_diff

thp mentioned this pull request Jul 15, 2020

Move diff code to JobState, cache result #527

Merged

thp closed this Jul 15, 2020

mborsetti deleted the cache_unified_diff branch July 19, 2020 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache unified_diff function #506

Cache unified_diff function #506

mborsetti commented May 24, 2020

thp left a comment

mborsetti commented Jul 8, 2020

thp commented Jul 10, 2020

mborsetti commented Jul 11, 2020

thp commented Jul 11, 2020

mborsetti commented Jul 12, 2020

mborsetti commented Jul 12, 2020

thp commented Jul 15, 2020

Cache unified_diff function #506

Cache unified_diff function #506

Conversation

mborsetti commented May 24, 2020

thp left a comment

Choose a reason for hiding this comment

mborsetti commented Jul 8, 2020

thp commented Jul 10, 2020

mborsetti commented Jul 11, 2020

thp commented Jul 11, 2020

mborsetti commented Jul 12, 2020

mborsetti commented Jul 12, 2020

thp commented Jul 15, 2020