Refactor duplicates queries and report processing #52
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was running into an issue similar to that described in #42 -- running Brunnhilde on a collection with several tens of thousands of files was seemingly freezing for several hours at the "Siegfried scan complete. Processing results." step with no CSV reports written, an empty report.html, and an open siegfried.sqlite.
I opened a copy of siegfried.sqlite and ran each of the queries found throughout brunnhilde.py to see if there was a place where things might be getting hung up. The queries to find and count duplicates took several hours each -- I'm not exactly sure what was going on, but I think the query was conducting a separate full table scan for each hash, which isn't noticeable on smaller collections but results in exponentially longer queries as the number of files in a collection grows.
I refactored the duplicates queries to instead compare one full table scan against the results of a single subquery that looks for hashes that appear in the database more than once. I tested the new queries on a few examples and as far as I can tell they return the same results as the old queries in a much shorter amount of time.
I also reworked the duplicates.csv processing in the HTML report generation step to only read the CSV once -- the performance improvements are not quite as great as the SQL query modifications, but still make a bit of a difference for collections with a very large number of duplicate files. As with the SQL queries, I ran this against a few test cases and the "Duplicates" section of the HTML report looked the same as before.
Let me know if these changes look okay to you or if there's anything different you'd like to see in this PR, and thanks as always for creating and maintaining such an excellent tool.