Skip to content

variomeanalytics/pipette_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Pipette Benchmark Outputs

Reproducibility repository for "Pipette: Encoding scientific literature into an executable skill graph for multi-agent bioinformatics".

This repository contains the complete, unmodified agent outputs for all benchmark tasks described in the manuscript, including scripts, results, figures, reviewer reports, and provenance logs generated autonomously by Pipette.


Repository Structure

benchmark/
  DrugDesign/
    Imatinib/          # Small molecule docking: imatinib into ABL1 (PDB 2HYY)
    peptides/          # De novo cyclic peptide design targeting p53-MDM2 (PDB 1YCR)
  clinical_variants/   # ACMG SF v3.2 analysis of GIAB HG002 (chr1-22)
  clinical_variants_spikein/
    spikein/           # Spike-in variant design (7 pathogenic variants + injection script)
    agent_results/     # Agent analysis of the spiked VCF
  scRNA-seq/
    PBMC68K/           # 68K PBMC single-cell analysis (10x Genomics, Zheng et al. 2017)
    Pancreas/          # Human pancreas single-cell analysis (CEL-seq2, Muraro et al. 2016)
  rice_RNA-seq/        # Bulk RNA-seq differential expression (rice drought/well-watered)

Benchmark Descriptions

1. Drug Design: Imatinib-ABL1 Molecular Docking

  • Task: Redock imatinib into ABL1 kinase ATP-binding site
  • Input: PDB 2HYY
  • Key result: Binding affinity -11.8 kcal/mol, heavy-atom RMSD 0.89 A vs. crystal pose
  • Self-correction: Agent fixed missing pH 7.4 protonation and implemented Hungarian RMSD validation after reviewer flagged both issues

2. Drug Design: Cyclic Peptide Design (p53-MDM2)

  • Task: Design 10 cyclic peptides mimicking the p53 Phe19-Trp23-Leu26 hotspot triad
  • Input: PDB 1YCR
  • Key result: Top candidate CP07 at -12.1 kcal/mol; hotspot triad C-alpha RMSD 0.95 A vs. native p53
  • Self-correction: Agent rescued 2 failed dockings with SMINA and replaced centroid displacement with Kabsch SVD RMSD

3. Clinical Variant Analysis: GIAB HG002

  • Task: Identify ACMG SF v3.2 secondary findings in HG002 benchmark genome
  • Input: HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
  • Key result: 0 P/LP, 9 VUS, 92 Benign — correct for a healthy reference individual
  • Self-correction: Agent demoted a false-positive PRKAG2 LP call (non-canonical transcript artefact)

4. Clinical Variant Analysis: Spike-in Benchmark

  • Task: Detect 7 known pathogenic variants injected into HG002 VCF
  • Input: Spiked VCF (see spikein/spike_in_truth.csv for ground truth)
  • Key result: 7/7 sensitivity, 0 false positives, MUTYH compound het correctly identified
  • Spike-in design: 3 variant types (missense, nonsense, frameshift), 2 inheritance modes (AD, AR), 5 disease categories

5. Single-Cell: PBMC 68K

  • Task: Complete scRNA-seq analysis of 68K PBMCs
  • Input: 10x Genomics Fresh 68K PBMC dataset (Zheng et al., 2017)
  • Key result: Proportional concordance r = 0.959 vs. published cell type proportions; mean marker gene recall 69.3%

6. Single-Cell: Human Pancreas

  • Task: Complete scRNA-seq analysis of human pancreas (4 donors, CEL-seq2)
  • Input: GSE85241 (Muraro et al., 2016)
  • Key result: Harmonised ARI = 0.97, proportional concordance r = 0.991, mean marker gene recall 88.6%

7. Bulk RNA-seq: Rice Leaf Drought and Heat Stress

  • Task: DESeq2 differential expression analysis of rice leaf segments under four conditions: well-watered (W30), drought (D30), heat shock (W40), and combined drought + heat shock (D40)
  • Input: Raw count matrix (42,189 genes x 80 samples, GEO: GSE295637)

Output Structure (per benchmark)

Each benchmark directory contains the agent's unmodified outputs:

Directory Contents
results/ Analysis results (CSVs, JSONs, classified variants, cell metadata)
figures/ Agent-generated visualisations (PNG)
scripts/ All Python/Bash scripts the agent wrote and executed
reports/ Final analysis report (Markdown)
logs/ Execution logs and stderr
thinking.md Agent reasoning log (iteration-by-iteration decision making)
provenance.json Full execution provenance (inputs, outputs, SHA1 hashes, wall times)
results/reviewer_report.json Reviewer Agent verdict (pass/warn/fail per check item)
results/reviewer_report_rereviewed.json Second-pass reviewer verdict (after self-correction)

Reviewer Reports

Each benchmark includes the Reviewer Agent's quality-check reports in JSON format. These contain structured verdicts (pass/warn/fail) with evidence and recommendations for each check item. Pre-review reports (where available) are in pre_review/reviewer_report.json; post-revision reports are in results/reviewer_report.json.


Reproducing the Spike-in Benchmark

cd benchmark/clinical_variants_spikein/spikein/
bash inject_spikein.sh

This merges spike_in_variants.vcf (7 known pathogenic variants) into the GIAB HG002 VCF using bcftools, producing HG002_GRCh38_1_22_v4.2.1_benchmark_spiked.vcf.gz. The ground truth is in spike_in_truth.csv.

Note: The original HG002 VCF (HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz) is not included due to size (~1.5 GB). It can be downloaded from the GIAB FTP.


Data Sources

Dataset Source Accession
HG002 benchmark VCF NIST Genome in a Bottle NISTv4.2.1/GRCh38
ACMG SF v3.2 gene panel Miller et al., 2023 Genetics in Medicine 25:100866
ABL1 kinase structure RCSB PDB 2HYY
MDM2-p53 structure RCSB PDB 1YCR
PBMC 68K 10x Genomics Fresh 68K PBMCs
Human pancreas NCBI GEO GSE85241
Rice RNA-seq DESeq2 RDS object & spiked-in VCF Zenodo 10.5281/zenodo.19433635

Citation

If you use these benchmark outputs, please cite:

[Manuscript citation to be added upon publication]


License

Agent outputs are provided for research reproducibility. Input datasets are subject to their respective data providers' terms of use.

About

Agent outputs and reports mentioned in the preprint

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages