A pipeline that quantifies how BLAST's detection of hazardous DNA sequences degrades as adversaries fragment and mutate the input. Built for the AIxBio hackathon to inform the policy debate around minimum DNA synthesis screening thresholds.
DNA synthesis providers screen orders against curated lists of hazardous sequences (toxin genes, pathogen markers, etc.). The 2024 OSTP Framework currently requires screening at the 200 bp level, with the threshold dropping to 50 bp effective October 2026. We measure how the upcoming 50 bp threshold holds up when an adversary:
- Fragments the hazardous gene into small pieces (varying T from 20 to 200 bp)
- Mutates the fragments to evade exact-match alignment (varying rate from 0 to 20%)
- Hides hazardous fragments inside benign filler (dilute evasion vs pure evasion)
The output is a per-fragment sensitivity profile and a per-order ROC under "any-flag" aggregation, summarized into four figures, plus a robustness check under a more rigorous data configuration.
Per-fragment sensitivity: at the upcoming 50 bp threshold and standard biosecurity cutoff (e ≤ 1e-9), BLAST detects 100% of unmutated hazardous fragments but only 16% of fragments with 20% adversarial mutation.
Dilute attack failure mode: at the same 50 bp threshold and 20% mutation, BLAST catches only 60% of dilute attacks at 1% FPR (one hazardous fragment hidden among benign filler). Pure attacks (orders entirely composed of hazardous fragments) remain caught reliably across nearly all configurations.
Robustness: under an expanded data configuration (10 diversified honest sources spanning bacterial, viral, eukaryotic, plant, and synthetic origins; 40-record hazardous database under a 50 kb cap), the qualitative findings are preserved and the quantitative gap widens (60% → 48% catch rate at the same operating point).
See RESULTS_MEMO.md for full interpretation and the writeup PDF for the published version.
pipeline/ # All scripts run from here
fetch_data.py # NCBI download of E. coli K-12 (primary honest source)
fetch_diverse_honest.py # NCBI download of 10 diverse sources (robustness honest source)
aggregate_hazardous.py # Merge .gb + .fasta files into one consolidated FASTA
build_corpus.py # Generate honest, evasion_pure, evasion_dilute corpora
# --mutation-rate, --hazardous, --out-dir flags
screen.py # BLAST a corpus, write per-fragment + per-order CSVs
evaluate.py # Sweep thresholds, compute ROC, plot all 4 figures
benchmark.py # Optional: time the pipeline at different corpus sizes
hazardous_PLACEHOLDER.fasta # Tiny synthetic placeholder for dry-run testing
RUNBOOK.md # Step-by-step run instructions (Windows / cmd)
data/ # All gitignored
hazardous_data/ # YOU PROVIDE: folder of .gb / .fasta files
hazardous.fasta # GENERATED by aggregate_hazardous.py + cap filter
hazardous_manifest.tsv # GENERATED — what records came from where
honest_source.fasta # GENERATED by fetch_data.py or fetch_diverse_honest.py
honest_sources/ # GENERATED — per-source cache for fetch_diverse_honest.py
honest_manifest.tsv # GENERATED — per-source audit trail
corpora/ # GENERATED by build_corpus.py
# one subdir per (T, mutation rate)
results/
v1_5/ # Primary results (E. coli only, 19 records, 10 kb cap)
fig1_fragment_sensitivity.png
fig2_order_roc_panels.png
fig3_evasion_heatmap.png
fig4_operating_points.png
fragment_sensitivity.csv
roc_data.csv
summary.csv
v2_robustness/ # Robustness check (10 honest sources, 40 records, 50 kb cap)
(same file structure as v1_5/)
README.md # Explains the v1_5 vs v2_robustness split
README.md # This file
RESULTS_MEMO.md # Interpretive document — read before the figures
.gitignore
Prerequisites: Windows + Miniforge + a blast-env conda env with blast, biopython, pandas, matplotlib. See pipeline/RUNBOOK.md for one-time setup.
cd pipeline
REM 1. Get honest source genome (E. coli K-12 MG1655, ~5 MB)
python fetch_data.py
REM 2. Aggregate your hazardous .gb / .fasta folder into one FASTA
REM Requires data/hazardous_data/ folder pre-populated with your input files
python aggregate_hazardous.py
REM 3. Filter to records ≤ 10 kb (toxin-CDS scale; primary configuration)
python -c "from Bio import SeqIO; recs=[r for r in SeqIO.parse('../data/hazardous.fasta','fasta') if len(r.seq)<=10000]; SeqIO.write(recs, '../data/hazardous.fasta', 'fasta'); print(f'Kept {len(recs)} records')"
REM 4. Build the BLAST DB
makeblastdb -in ..\data\hazardous.fasta -dbtype nucl -out hazardous_db -title "Hazardous reference"
REM 5. Generate corpora at all (T, mutation) combinations (~5 min)
for %M in (0.0 0.01 0.05 0.10 0.15 0.20) do (
for %T in (20 30 50 75 100 150 200) do (
python build_corpus.py --T %T --n-orders 1000 --mutation-rate %M
)
)
REM 6. Screen all corpora (~30-45 min)
for %M in (0 1 5 10 15 20) do (
for %T in (20 30 50 75 100 150 200) do (
python screen.py --corpus ..\data\corpora\T%T_mut%M\honest.fasta --db hazardous_db
python screen.py --corpus ..\data\corpora\T%T_mut%M\evasion_pure.fasta --db hazardous_db
python screen.py --corpus ..\data\corpora\T%T_mut%M\evasion_dilute.fasta --db hazardous_db
)
)
REM 7. Produce all four figures and summary tables
python evaluate.pyTo reproduce the robustness check (Appendix C of the writeup), the only differences are:
- Honest source: 10 diversified sources (~100 Mb) instead of E. coli alone. Replace step 1 with
python fetch_diverse_honest.py(downloads cache todata/honest_sources/and consolidates todata/honest_source.fasta). - Hazardous cap: 50 kb instead of 10 kb. In step 3, change
len(r.seq)<=10000tolen(r.seq)<=50000.
All other steps are unchanged. After running evaluate.py, the figures will reflect the v2 configuration.
To preserve both result sets side by side, move outputs into results/v1_5/ or results/v2_robustness/ between runs (or update evaluate.py's output directory).
For dry-run / development with the placeholder hazardous data, copy pipeline/hazardous_PLACEHOLDER.fasta to data/hazardous.fasta and run from step 4.
A consolidated hazardous reference (primary: 19 records ≤ 10 kb; robustness: 40 records ≤ 50 kb) is built into a nucleotide BLAST DB. For each fragment-length T ∈ {20, 30, 50, 75, 100, 150, 200} bp and mutation rate M ∈ {0, 1, 5, 10, 15, 20}%, we generate three order corpora of 1000 orders each: honest (5–20 random T-bp slices from E. coli K-12 MG1655 in primary, or from a 10-source diversified corpus in robustness), evasion_pure (5–20 random T-bp slices from one hazardous parent, mutated at rate M), and evasion_dilute (one mutated hazardous fragment + 4–19 unmutated honest fillers). Mutation introduces independent per-base random substitutions to a different nucleotide, modeling an adversary who synthesizes a variant of the hazardous gene rather than the literal sequence. Each corpus is screened with blastn -task blastn-short -evalue 10 against the hazardous DB. Per-fragment results feed Figures 1 and 3; per-order results aggregate via "any-flag" (an order is flagged if any fragment hits below the screening threshold) and feed Figures 2, 3, and 4.
- Figure 1 (
fig1_fragment_sensitivity.png) — per-fragment sensitivity. Three rows of panels stratified by e-value cutoff (1e-3, 1e-9, 1e-30); left column shows hazardous TPR vs T, right column shows honest FPR. One curve per mutation rate (0–20%, viridis colormap). The scientific backup figure showing BLAST's underlying sensitivity profile. - Figure 2 (
fig2_order_roc_panels.png) — per-order ROC under any-flag aggregation. 7×6 grid: rows = T, cols = mutation rate. Pure (blue) and dilute (orange) curves overlaid. Dense; supplementary use only. - Figure 3 (
fig3_evasion_heatmap.png) — the headline figure. Two heatmaps side by side: pure evasion (left) and dilute evasion (right). T on Y, mutation rate on X, color = TPR at FPR=1%. Makes the "evasion gap" visible in one glance. - Figure 4 (
fig4_operating_points.png) — slide-friendly summary. Bar chart of catch rates at T=50 (the upcoming OSTP threshold) across three adversary types (naive 0%, realistic 10%, sophisticated 20%) and three FPR operating points.
- Honest source diversity: The robustness check (10 sources, ~100 Mb spanning bacterial, viral, eukaryotic, plant, and synthetic) confirms honest FPR remains ≤ 0.3% across all configurations. Whether this reflects database-bounded behavior or insufficient diversity at our scale is open — real customer orders span synthetic-bio constructs and poorly-annotated organism content we did not measure.
- Mutation model: uniform-random substitution. Real adversaries can do better with synonymous substitutions or codon optimization. Our 20% rate is an upper bound on per-base mutation difficulty; realistic adversaries may achieve our failure region at lower per-base rates.
- Hazardous DB size and composition: 19–40 curated records (~250 kb to 1 Mb). Production screeners use thousands of records totaling 100+ Mb. Relative shape of findings expected to hold; absolute numbers may shift.
- No commec comparison yet: BLAST baseline only. Commec adds taxonomic clearing and HMM-based biorisk detection that could substantially change the numbers. The pipeline is structured so
screen.pycan be replaced with a commec-equivalent. - Cap-vs-data confound in robustness check: v2 changes both the hazardous cap (10 kb → 50 kb) and the data (19 → 40 records). We cannot fully separate the e-value tightening from the inclusion of less coding-dense records. Disentangling these is left to future work.
- No cross-order experiment: Orders evaluated independently. Adversaries can in principle split fragments across N synthetic orders to defeat any per-order screener.
- E-value criterion: We use e-values for screening decisions. E-values are database-size-dependent; alternative criteria (bitscore, percent identity) are not. Our results characterize BLAST under e-value thresholds specifically.
The most useful artifact for writeup work is results/v1_5/summary.csv (primary) or results/v2_robustness/summary.csv (robustness). Each row is one (T, mutation%, evasion_type) cell with AUC and TPR at fixed FPR points. To answer "what does BLAST do at the upcoming 50 bp threshold against a realistic adversary" — filter to T=50, mut_pct ∈ {10, 20}.
For interpretation guidance and pre-empted methodology questions, read RESULTS_MEMO.md first.