Skip to content

utkarshsingh34/AIxBio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BLAST sensitivity benchmark for DNA synthesis screening

A pipeline that quantifies how BLAST's detection of hazardous DNA sequences degrades as adversaries fragment and mutate the input. Built for the AIxBio hackathon to inform the policy debate around minimum DNA synthesis screening thresholds.

The question this answers

DNA synthesis providers screen orders against curated lists of hazardous sequences (toxin genes, pathogen markers, etc.). The 2024 OSTP Framework currently requires screening at the 200 bp level, with the threshold dropping to 50 bp effective October 2026. We measure how the upcoming 50 bp threshold holds up when an adversary:

  1. Fragments the hazardous gene into small pieces (varying T from 20 to 200 bp)
  2. Mutates the fragments to evade exact-match alignment (varying rate from 0 to 20%)
  3. Hides hazardous fragments inside benign filler (dilute evasion vs pure evasion)

The output is a per-fragment sensitivity profile and a per-order ROC under "any-flag" aggregation, summarized into four figures, plus a robustness check under a more rigorous data configuration.

Headline findings

Per-fragment sensitivity: at the upcoming 50 bp threshold and standard biosecurity cutoff (e ≤ 1e-9), BLAST detects 100% of unmutated hazardous fragments but only 16% of fragments with 20% adversarial mutation.

Dilute attack failure mode: at the same 50 bp threshold and 20% mutation, BLAST catches only 60% of dilute attacks at 1% FPR (one hazardous fragment hidden among benign filler). Pure attacks (orders entirely composed of hazardous fragments) remain caught reliably across nearly all configurations.

Robustness: under an expanded data configuration (10 diversified honest sources spanning bacterial, viral, eukaryotic, plant, and synthetic origins; 40-record hazardous database under a 50 kb cap), the qualitative findings are preserved and the quantitative gap widens (60% → 48% catch rate at the same operating point).

See RESULTS_MEMO.md for full interpretation and the writeup PDF for the published version.

Repo layout

pipeline/                       # All scripts run from here
  fetch_data.py                 # NCBI download of E. coli K-12 (primary honest source)
  fetch_diverse_honest.py       # NCBI download of 10 diverse sources (robustness honest source)
  aggregate_hazardous.py        # Merge .gb + .fasta files into one consolidated FASTA
  build_corpus.py               # Generate honest, evasion_pure, evasion_dilute corpora
                                #   --mutation-rate, --hazardous, --out-dir flags
  screen.py                     # BLAST a corpus, write per-fragment + per-order CSVs
  evaluate.py                   # Sweep thresholds, compute ROC, plot all 4 figures
  benchmark.py                  # Optional: time the pipeline at different corpus sizes
  hazardous_PLACEHOLDER.fasta   # Tiny synthetic placeholder for dry-run testing
  RUNBOOK.md                    # Step-by-step run instructions (Windows / cmd)

data/                           # All gitignored
  hazardous_data/               # YOU PROVIDE: folder of .gb / .fasta files
  hazardous.fasta               # GENERATED by aggregate_hazardous.py + cap filter
  hazardous_manifest.tsv        # GENERATED — what records came from where
  honest_source.fasta           # GENERATED by fetch_data.py or fetch_diverse_honest.py
  honest_sources/               # GENERATED — per-source cache for fetch_diverse_honest.py
  honest_manifest.tsv           # GENERATED — per-source audit trail
  corpora/                      # GENERATED by build_corpus.py
                                #   one subdir per (T, mutation rate)

results/
  v1_5/                         # Primary results (E. coli only, 19 records, 10 kb cap)
    fig1_fragment_sensitivity.png
    fig2_order_roc_panels.png
    fig3_evasion_heatmap.png
    fig4_operating_points.png
    fragment_sensitivity.csv
    roc_data.csv
    summary.csv
  v2_robustness/                # Robustness check (10 honest sources, 40 records, 50 kb cap)
    (same file structure as v1_5/)
  README.md                     # Explains the v1_5 vs v2_robustness split

README.md                       # This file
RESULTS_MEMO.md                 # Interpretive document — read before the figures
.gitignore

Reproducing the results

Prerequisites: Windows + Miniforge + a blast-env conda env with blast, biopython, pandas, matplotlib. See pipeline/RUNBOOK.md for one-time setup.

Primary results (v1.5 configuration)

cd pipeline

REM 1. Get honest source genome (E. coli K-12 MG1655, ~5 MB)
python fetch_data.py

REM 2. Aggregate your hazardous .gb / .fasta folder into one FASTA
REM    Requires data/hazardous_data/ folder pre-populated with your input files
python aggregate_hazardous.py

REM 3. Filter to records ≤ 10 kb (toxin-CDS scale; primary configuration)
python -c "from Bio import SeqIO; recs=[r for r in SeqIO.parse('../data/hazardous.fasta','fasta') if len(r.seq)<=10000]; SeqIO.write(recs, '../data/hazardous.fasta', 'fasta'); print(f'Kept {len(recs)} records')"

REM 4. Build the BLAST DB
makeblastdb -in ..\data\hazardous.fasta -dbtype nucl -out hazardous_db -title "Hazardous reference"

REM 5. Generate corpora at all (T, mutation) combinations (~5 min)
for %M in (0.0 0.01 0.05 0.10 0.15 0.20) do (
    for %T in (20 30 50 75 100 150 200) do (
        python build_corpus.py --T %T --n-orders 1000 --mutation-rate %M
    )
)

REM 6. Screen all corpora (~30-45 min)
for %M in (0 1 5 10 15 20) do (
    for %T in (20 30 50 75 100 150 200) do (
        python screen.py --corpus ..\data\corpora\T%T_mut%M\honest.fasta --db hazardous_db
        python screen.py --corpus ..\data\corpora\T%T_mut%M\evasion_pure.fasta --db hazardous_db
        python screen.py --corpus ..\data\corpora\T%T_mut%M\evasion_dilute.fasta --db hazardous_db
    )
)

REM 7. Produce all four figures and summary tables
python evaluate.py

Robustness check (v2 configuration)

To reproduce the robustness check (Appendix C of the writeup), the only differences are:

  • Honest source: 10 diversified sources (~100 Mb) instead of E. coli alone. Replace step 1 with python fetch_diverse_honest.py (downloads cache to data/honest_sources/ and consolidates to data/honest_source.fasta).
  • Hazardous cap: 50 kb instead of 10 kb. In step 3, change len(r.seq)<=10000 to len(r.seq)<=50000.

All other steps are unchanged. After running evaluate.py, the figures will reflect the v2 configuration.

To preserve both result sets side by side, move outputs into results/v1_5/ or results/v2_robustness/ between runs (or update evaluate.py's output directory).

For dry-run / development with the placeholder hazardous data, copy pipeline/hazardous_PLACEHOLDER.fasta to data/hazardous.fasta and run from step 4.

Methodology in one paragraph

A consolidated hazardous reference (primary: 19 records ≤ 10 kb; robustness: 40 records ≤ 50 kb) is built into a nucleotide BLAST DB. For each fragment-length T ∈ {20, 30, 50, 75, 100, 150, 200} bp and mutation rate M ∈ {0, 1, 5, 10, 15, 20}%, we generate three order corpora of 1000 orders each: honest (5–20 random T-bp slices from E. coli K-12 MG1655 in primary, or from a 10-source diversified corpus in robustness), evasion_pure (5–20 random T-bp slices from one hazardous parent, mutated at rate M), and evasion_dilute (one mutated hazardous fragment + 4–19 unmutated honest fillers). Mutation introduces independent per-base random substitutions to a different nucleotide, modeling an adversary who synthesizes a variant of the hazardous gene rather than the literal sequence. Each corpus is screened with blastn -task blastn-short -evalue 10 against the hazardous DB. Per-fragment results feed Figures 1 and 3; per-order results aggregate via "any-flag" (an order is flagged if any fragment hits below the screening threshold) and feed Figures 2, 3, and 4.

What's in the figures

  • Figure 1 (fig1_fragment_sensitivity.png) — per-fragment sensitivity. Three rows of panels stratified by e-value cutoff (1e-3, 1e-9, 1e-30); left column shows hazardous TPR vs T, right column shows honest FPR. One curve per mutation rate (0–20%, viridis colormap). The scientific backup figure showing BLAST's underlying sensitivity profile.
  • Figure 2 (fig2_order_roc_panels.png) — per-order ROC under any-flag aggregation. 7×6 grid: rows = T, cols = mutation rate. Pure (blue) and dilute (orange) curves overlaid. Dense; supplementary use only.
  • Figure 3 (fig3_evasion_heatmap.png)the headline figure. Two heatmaps side by side: pure evasion (left) and dilute evasion (right). T on Y, mutation rate on X, color = TPR at FPR=1%. Makes the "evasion gap" visible in one glance.
  • Figure 4 (fig4_operating_points.png)slide-friendly summary. Bar chart of catch rates at T=50 (the upcoming OSTP threshold) across three adversary types (naive 0%, realistic 10%, sophisticated 20%) and three FPR operating points.

Known limitations

  • Honest source diversity: The robustness check (10 sources, ~100 Mb spanning bacterial, viral, eukaryotic, plant, and synthetic) confirms honest FPR remains ≤ 0.3% across all configurations. Whether this reflects database-bounded behavior or insufficient diversity at our scale is open — real customer orders span synthetic-bio constructs and poorly-annotated organism content we did not measure.
  • Mutation model: uniform-random substitution. Real adversaries can do better with synonymous substitutions or codon optimization. Our 20% rate is an upper bound on per-base mutation difficulty; realistic adversaries may achieve our failure region at lower per-base rates.
  • Hazardous DB size and composition: 19–40 curated records (~250 kb to 1 Mb). Production screeners use thousands of records totaling 100+ Mb. Relative shape of findings expected to hold; absolute numbers may shift.
  • No commec comparison yet: BLAST baseline only. Commec adds taxonomic clearing and HMM-based biorisk detection that could substantially change the numbers. The pipeline is structured so screen.py can be replaced with a commec-equivalent.
  • Cap-vs-data confound in robustness check: v2 changes both the hazardous cap (10 kb → 50 kb) and the data (19 → 40 records). We cannot fully separate the e-value tightening from the inclusion of less coding-dense records. Disentangling these is left to future work.
  • No cross-order experiment: Orders evaluated independently. Adversaries can in principle split fragments across N synthetic orders to defeat any per-order screener.
  • E-value criterion: We use e-values for screening decisions. E-values are database-size-dependent; alternative criteria (bitscore, percent identity) are not. Our results characterize BLAST under e-value thresholds specifically.

Output cheat sheet

The most useful artifact for writeup work is results/v1_5/summary.csv (primary) or results/v2_robustness/summary.csv (robustness). Each row is one (T, mutation%, evasion_type) cell with AUC and TPR at fixed FPR points. To answer "what does BLAST do at the upcoming 50 bp threshold against a realistic adversary" — filter to T=50, mut_pct ∈ {10, 20}.

For interpretation guidance and pre-empted methodology questions, read RESULTS_MEMO.md first.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages