Refactor duplicates queries and report processing #52

djpillen · 2021-08-24T15:31:07Z

I was running into an issue similar to that described in #42 -- running Brunnhilde on a collection with several tens of thousands of files was seemingly freezing for several hours at the "Siegfried scan complete. Processing results." step with no CSV reports written, an empty report.html, and an open siegfried.sqlite.

I opened a copy of siegfried.sqlite and ran each of the queries found throughout brunnhilde.py to see if there was a place where things might be getting hung up. The queries to find and count duplicates took several hours each -- I'm not exactly sure what was going on, but I think the query was conducting a separate full table scan for each hash, which isn't noticeable on smaller collections but results in exponentially longer queries as the number of files in a collection grows.

I refactored the duplicates queries to instead compare one full table scan against the results of a single subquery that looks for hashes that appear in the database more than once. I tested the new queries on a few examples and as far as I can tell they return the same results as the old queries in a much shorter amount of time.

I also reworked the duplicates.csv processing in the HTML report generation step to only read the CSV once -- the performance improvements are not quite as great as the SQL query modifications, but still make a bit of a difference for collections with a very large number of duplicate files. As with the SQL queries, I ran this against a few test cases and the "Duplicates" section of the HTML report looked the same as before.

Let me know if these changes look okay to you or if there's anything different you'd like to see in this PR, and thanks as always for creating and maintaining such an excellent tool.

tw4l · 2021-09-02T02:48:23Z

Hi @djpillen, thank you for this pull request! From a quick glance this seems like a solid improvement. I should have time in the next few days to review more properly, and I'd be happy to get this into a point release fairly quickly.

kieranjol · 2021-11-10T12:48:59Z

Hi @djpillen - I'd like to be able to replicate and test this, apart from generating a bunch of files, do they need to be a particular minimum size?

tw4l · 2022-01-14T04:42:26Z

Thank you @djpillen - sorry it took me a while, but I'm happy to merge this and cut a patch release!

* refactor duplicates sql queries to improve performance * refactor duplicates.csv report parsing

djpillen added 2 commits August 23, 2021 15:49

refactor duplicates sql queries to improve performance

5bd0b7b

refactor duplicates.csv report parsing

b6557da

tw4l changed the base branch from main to develop January 14, 2022 04:41

tw4l merged commit c53fb03 into tw4l:develop Jan 14, 2022

tw4l pushed a commit that referenced this pull request Jan 14, 2022

Refactor duplicates queries and report processing (#52)

7264542

* refactor duplicates sql queries to improve performance * refactor duplicates.csv report parsing

tw4l mentioned this pull request Jan 14, 2022

Freeze on large collection (>3 million files) with many duplicates, special characters #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor duplicates queries and report processing #52

Refactor duplicates queries and report processing #52

djpillen commented Aug 24, 2021

tw4l commented Sep 2, 2021

kieranjol commented Nov 10, 2021

tw4l commented Jan 14, 2022

Refactor duplicates queries and report processing #52

Refactor duplicates queries and report processing #52

Conversation

djpillen commented Aug 24, 2021

tw4l commented Sep 2, 2021

kieranjol commented Nov 10, 2021

tw4l commented Jan 14, 2022