Identify exact and near-duplicate files (text / source) within a directory tree.
- Shingling (k-token) with configurable size
- Hashed shingles + Jaccard similarity
- Parallel signature scan (
--workers) for larger corpora - MinHash + LSH prefilter (
--prefilter) to prune candidate pairs (scales better) - Cluster output mode (
--clusters) groups interconnected duplicates - CLI JSON or table output; schema versioned and documented
- Comprehensive test framework: unit, integration, property, performance tests
- CI via GitHub Actions (multi-version Python)
- Extensible: plug in tokenizers, ignore patterns (planned), semantic strategies
pip install -e .[dev]
Requires Python >=3.9.
Basic scan:
duplicate-finder scan ./repo --threshold 0.85 --ext .py,.md,.txt --k 5
Parallel scan (6 workers):
duplicate-finder scan ./repo --workers 6
MinHash+LSH prefilter (recommended for >1k files):
duplicate-finder scan ./big --prefilter --minhash-perms 64 --lsh-bands 16
Cluster output (table):
duplicate-finder scan ./repo --clusters
Cluster JSON:
duplicate-finder scan ./repo --clusters --json
Run full suite:
pytest
Run only integration tests:
pytest tests/integration
Skip slow tests:
pytest -m "not slow"
Run slow performance tests:
pytest -m slow -v
Run property-based tests (Hypothesis):
pytest -m property
GitHub Actions runs tests on Python 3.9-3.12 for every push/PR. Slow tests run only on main branch pushes.
See docs/json-schema.md for complete schema documentation and versioning policy.
Validate output with schema/duplicates.schema.json (JSON Schema draft-07).
- Normalize whitespace.
- Tokenize via regex
[A-Za-z0-9_]+. - Build k-token shingles; hash with MD5.
- Optional MinHash signature + LSH banding to pick candidate pairs.
- Jaccard similarity on hashed shingle sets for scoring.
--prefilterbuilds MinHash signatures (--minhash-perms) and buckets them into bands (--lsh-bands).- Reduces pairwise comparison count; identical results retained for high probability settings.
- For small datasets (<50 files) prefilter automatically skipped internally.
Duplicate pairs are converted into connected components. Representative file chosen lexicographically; cluster size & max intra-pair similarity reported.
- Pair mode: similarity, file paths, token counts.
- Cluster mode: cluster id, size, representative, max similarity.
- JSON includes
schema_versionfor downstream stability.
Synthetic generation:
python benchmarks/run_benchmarks.py --files 800 --dup-groups 80 --group-size 5 --workers 6 --verbose
Profiling (serial vs parallel vs prefilter, with memory):
python benchmarks/run_profile.py ./repo --parallel-workers 6 --repeat 3
Artifacts written: benchmarks/last_profile.md, benchmarks/last_profile.json.
src/duplicate_finder/
__init__.py
core.py
minhash.py
cluster.py
index.py
cli.py
benchmarks/
run_benchmarks.py
run_profile.py
README.md
tests/
conftest.py
unit/
integration/
performance/
test_*.py
docs/
json-schema.md
schema/
duplicates.schema.json
.github/workflows/
ci.yml
LICENSE
- Ignore patterns / region filtering
- Parallel pairwise comparison
- MinHash parameter tuning
- Semantic duplicate detection (embeddings)
Open issues focused on a single feature/performance improvement. Include benchmark deltas when relevant. PRs run full test suite via CI.
MIT License - see LICENSE file for details.