This repository curates multiple open-source vulnerability corpora and provides tooling to normalise, deduplicate, and analyse them in a consistent format. The primary datasets include CrossVul, JaConTeBe, MegaVul, MSR, PrimeVul, SVEN, and several auxiliary benchmarks. Normalised exports power downstream research, while signature and statistics reports support quality checks and curation.
- Canonical CSV exports produced from heterogeneous dataset formats.
- Deterministic per-sample signatures for duplicate detection.
- CWE coverage statistics and category roll-ups derived from
collect.json. - Single-entry CLI (
python main.py) that orchestrates the full pipeline.
- Python 3.12+ (managed with
uv venvin this repository). - Dataset-specific dependencies are kept minimal; optional
pyarrowenables Juliet statistics.
- Activate the virtual environment (once created via
uv venv):source .venv/bin/activate - Normalize, deduplicate, and compute stats for all datasets:
Add
python main.py --dataset all
--force-normalizeor--force-signaturesto rebuild existing outputs. - Inspect generated artifacts:
- Normalised CSVs in
standardized/. - Signature manifests in
signatures/. - CWE counts in
cwe_counts.json. - Category summaries in
category_summary_level*.csv.
- Normalised CSVs in
python main.py [options]
--dataset NAME Target dataset(s); repeatable. Use "all" for every dataset.
--limit N Cap the number of rows emitted during normalization.
--signature-dir DIR Destination for signature CSVs (default: signatures/).
--force-normalize Rebuild normalized CSVs even if they already exist.
--force-signatures Rebuild signature CSVs even if they already exist.
--verbose Enable DEBUG logging for troubleshooting.
The toolkit implements a multi-stage pipeline that transforms raw vulnerability datasets into clean, deduplicated benchmarks. See docs/PIPELINE.md for the complete pipeline documentation.
Pipeline Stages:
- Standardization (
scripts/normalize_datasets.py) - Convert datasets to unified CSV format - Signature Generation (
scripts/signatures.py) - Create content hashes for deduplication - Deduplication (
scripts/clean_duplicates.py) - Remove duplicate entries - Benchmark Creation (
scripts/create_benchmark.py) - Merge into unified JSON - Filtering (
scripts/filter_benchmark.py) - Select representative samples - Clustering (
scripts/cluster_benchmark.py) - ML-based sample selection - Analysis (
scripts/analyze_cwe.py) - Statistical analysis and reporting
scripts/normalize_datasets.py: Dataset-specific normalizers (invoked viamain.py)scripts/signatures.py: Standalone signature generationscripts/clean_duplicates.py: Cross-dataset deduplicationscripts/create_benchmark.py: Unified benchmark creationscripts/filter_benchmark.py: Stratified sample selection (10 per CWE)scripts/cluster_benchmark.py: Embedding-based clustering and selection
scripts/analyze_cwe.py: Unified CWE analysis tool- Simple counting mode:
python scripts/analyze_cwe.py input.jsonl - Detailed analysis:
python scripts/analyze_cwe.py input.jsonl --detailed - CSV export:
python scripts/analyze_cwe.py input.jsonl -o stats.csv
- Simple counting mode:
scripts/cwe_stats.py: Comprehensive CWE statisticsscripts/category_summary.py: Category-level aggregationscripts/analyze_cwe_stats.py: Advanced CWE analytics
scripts/json_to_jsonl.py: JSON to JSONL conversionscripts/count_cwe.py: Simple CWE counting (superseded byanalyze_cwe.py)
crossvul/,JaConTeBe/,megavul/, etc.: raw dataset sources.src/dataset/: normalization modules for each corpus.src/signature.py: code canonicalisation and hashing utilities.src/utils/: shared utility modules for JSON/CSV I/O, CWE processing, logging, etc.standardized/: canonicalised CSV exports.signatures/: per-row signature manifests.clean/: deduplicated standardized data and signatures.scripts/: command-line helpers and analytics.docs/: documentation including the complete pipeline guide.
Follow the dataset naming and metadata conventions outlined in
user_instructions. When adding new samples:
- Update the corresponding dataset manifest (e.g.,
crossvul/metadata.json). - Regenerate the affected normalized CSV and signature files via
main.py. - Refresh statistics and summaries to keep downstream analyses in sync.
This repository aggregates datasets with their respective upstream licenses. Refer to each dataset directory for attribution details.