Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor duplicates queries and report processing #52

Merged
merged 2 commits into from
Jan 14, 2022

Conversation

djpillen
Copy link
Contributor

I was running into an issue similar to that described in #42 -- running Brunnhilde on a collection with several tens of thousands of files was seemingly freezing for several hours at the "Siegfried scan complete. Processing results." step with no CSV reports written, an empty report.html, and an open siegfried.sqlite.

I opened a copy of siegfried.sqlite and ran each of the queries found throughout brunnhilde.py to see if there was a place where things might be getting hung up. The queries to find and count duplicates took several hours each -- I'm not exactly sure what was going on, but I think the query was conducting a separate full table scan for each hash, which isn't noticeable on smaller collections but results in exponentially longer queries as the number of files in a collection grows.

I refactored the duplicates queries to instead compare one full table scan against the results of a single subquery that looks for hashes that appear in the database more than once. I tested the new queries on a few examples and as far as I can tell they return the same results as the old queries in a much shorter amount of time.

I also reworked the duplicates.csv processing in the HTML report generation step to only read the CSV once -- the performance improvements are not quite as great as the SQL query modifications, but still make a bit of a difference for collections with a very large number of duplicate files. As with the SQL queries, I ran this against a few test cases and the "Duplicates" section of the HTML report looked the same as before.

Let me know if these changes look okay to you or if there's anything different you'd like to see in this PR, and thanks as always for creating and maintaining such an excellent tool.

@tw4l
Copy link
Owner

tw4l commented Sep 2, 2021

Hi @djpillen, thank you for this pull request! From a quick glance this seems like a solid improvement. I should have time in the next few days to review more properly, and I'd be happy to get this into a point release fairly quickly.

@kieranjol
Copy link
Contributor

Hi @djpillen - I'd like to be able to replicate and test this, apart from generating a bunch of files, do they need to be a particular minimum size?

@tw4l tw4l changed the base branch from main to develop January 14, 2022 04:41
@tw4l tw4l merged commit c53fb03 into tw4l:develop Jan 14, 2022
@tw4l
Copy link
Owner

tw4l commented Jan 14, 2022

Thank you @djpillen - sorry it took me a while, but I'm happy to merge this and cut a patch release!

tw4l pushed a commit that referenced this pull request Jan 14, 2022
* refactor duplicates sql queries to improve performance
* refactor duplicates.csv report parsing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants