Pipette Benchmark Outputs

Reproducibility repository for "Pipette: Encoding scientific literature into an executable skill graph for multi-agent bioinformatics".

This repository contains the complete, unmodified agent outputs for all benchmark tasks described in the manuscript, including scripts, results, figures, reviewer reports, and provenance logs generated autonomously by Pipette.

Repository Structure

benchmark/
  DrugDesign/
    Imatinib/          # Small molecule docking: imatinib into ABL1 (PDB 2HYY)
    peptides/          # De novo cyclic peptide design targeting p53-MDM2 (PDB 1YCR)
  clinical_variants/   # ACMG SF v3.2 analysis of GIAB HG002 (chr1-22)
  clinical_variants_spikein/
    spikein/           # Spike-in variant design (7 pathogenic variants + injection script)
    agent_results/     # Agent analysis of the spiked VCF
  scRNA-seq/
    PBMC68K/           # 68K PBMC single-cell analysis (10x Genomics, Zheng et al. 2017)
    Pancreas/          # Human pancreas single-cell analysis (CEL-seq2, Muraro et al. 2016)
  rice_RNA-seq/        # Bulk RNA-seq differential expression (rice drought/well-watered)

Benchmark Descriptions

1. Drug Design: Imatinib-ABL1 Molecular Docking

Task: Redock imatinib into ABL1 kinase ATP-binding site
Input: PDB 2HYY
Key result: Binding affinity -11.8 kcal/mol, heavy-atom RMSD 0.89 A vs. crystal pose
Self-correction: Agent fixed missing pH 7.4 protonation and implemented Hungarian RMSD validation after reviewer flagged both issues

2. Drug Design: Cyclic Peptide Design (p53-MDM2)

Task: Design 10 cyclic peptides mimicking the p53 Phe19-Trp23-Leu26 hotspot triad
Input: PDB 1YCR
Key result: Top candidate CP07 at -12.1 kcal/mol; hotspot triad C-alpha RMSD 0.95 A vs. native p53
Self-correction: Agent rescued 2 failed dockings with SMINA and replaced centroid displacement with Kabsch SVD RMSD

3. Clinical Variant Analysis: GIAB HG002

Task: Identify ACMG SF v3.2 secondary findings in HG002 benchmark genome
Input: HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
Key result: 0 P/LP, 9 VUS, 92 Benign — correct for a healthy reference individual
Self-correction: Agent demoted a false-positive PRKAG2 LP call (non-canonical transcript artefact)

4. Clinical Variant Analysis: Spike-in Benchmark

Task: Detect 7 known pathogenic variants injected into HG002 VCF
Input: Spiked VCF (see spikein/spike_in_truth.csv for ground truth)
Key result: 7/7 sensitivity, 0 false positives, MUTYH compound het correctly identified
Spike-in design: 3 variant types (missense, nonsense, frameshift), 2 inheritance modes (AD, AR), 5 disease categories

5. Single-Cell: PBMC 68K

Task: Complete scRNA-seq analysis of 68K PBMCs
Input: 10x Genomics Fresh 68K PBMC dataset (Zheng et al., 2017)
Key result: Proportional concordance r = 0.959 vs. published cell type proportions; mean marker gene recall 69.3%

6. Single-Cell: Human Pancreas

Task: Complete scRNA-seq analysis of human pancreas (4 donors, CEL-seq2)
Input: GSE85241 (Muraro et al., 2016)
Key result: Harmonised ARI = 0.97, proportional concordance r = 0.991, mean marker gene recall 88.6%

7. Bulk RNA-seq: Rice Leaf Drought and Heat Stress

Task: DESeq2 differential expression analysis of rice leaf segments under four conditions: well-watered (W30), drought (D30), heat shock (W40), and combined drought + heat shock (D40)
Input: Raw count matrix (42,189 genes x 80 samples, GEO: GSE295637)

Output Structure (per benchmark)

Each benchmark directory contains the agent's unmodified outputs:

Directory	Contents
`results/`	Analysis results (CSVs, JSONs, classified variants, cell metadata)
`figures/`	Agent-generated visualisations (PNG)
`scripts/`	All Python/Bash scripts the agent wrote and executed
`reports/`	Final analysis report (Markdown)
`logs/`	Execution logs and stderr
`thinking.md`	Agent reasoning log (iteration-by-iteration decision making)
`provenance.json`	Full execution provenance (inputs, outputs, SHA1 hashes, wall times)
`results/reviewer_report.json`	Reviewer Agent verdict (pass/warn/fail per check item)
`results/reviewer_report_rereviewed.json`	Second-pass reviewer verdict (after self-correction)

Reviewer Reports

Each benchmark includes the Reviewer Agent's quality-check reports in JSON format. These contain structured verdicts (pass/warn/fail) with evidence and recommendations for each check item. Pre-review reports (where available) are in pre_review/reviewer_report.json; post-revision reports are in results/reviewer_report.json.

Reproducing the Spike-in Benchmark

cd benchmark/clinical_variants_spikein/spikein/
bash inject_spikein.sh

This merges spike_in_variants.vcf (7 known pathogenic variants) into the GIAB HG002 VCF using bcftools, producing HG002_GRCh38_1_22_v4.2.1_benchmark_spiked.vcf.gz. The ground truth is in spike_in_truth.csv.

Note: The original HG002 VCF (HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz) is not included due to size (~1.5 GB). It can be downloaded from the GIAB FTP.

Data Sources

Dataset	Source	Accession
HG002 benchmark VCF	NIST Genome in a Bottle	NISTv4.2.1/GRCh38
ACMG SF v3.2 gene panel	Miller et al., 2023	Genetics in Medicine 25:100866
ABL1 kinase structure	RCSB PDB	2HYY
MDM2-p53 structure	RCSB PDB	1YCR
PBMC 68K	10x Genomics	Fresh 68K PBMCs
Human pancreas	NCBI GEO	GSE85241
Rice RNA-seq DESeq2 RDS object & spiked-in VCF	Zenodo	10.5281/zenodo.19433635

Citation

If you use these benchmark outputs, please cite:

[Manuscript citation to be added upon publication]

License

Agent outputs are provided for research reproducibility. Input datasets are subject to their respective data providers' terms of use.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
benchmark		benchmark
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipette Benchmark Outputs

Repository Structure

Benchmark Descriptions

1. Drug Design: Imatinib-ABL1 Molecular Docking

2. Drug Design: Cyclic Peptide Design (p53-MDM2)

3. Clinical Variant Analysis: GIAB HG002

4. Clinical Variant Analysis: Spike-in Benchmark

5. Single-Cell: PBMC 68K

6. Single-Cell: Human Pancreas

7. Bulk RNA-seq: Rice Leaf Drought and Heat Stress

Output Structure (per benchmark)

Reviewer Reports

Reproducing the Spike-in Benchmark

Data Sources

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Pipette Benchmark Outputs

Repository Structure

Benchmark Descriptions

1. Drug Design: Imatinib-ABL1 Molecular Docking

2. Drug Design: Cyclic Peptide Design (p53-MDM2)

3. Clinical Variant Analysis: GIAB HG002

4. Clinical Variant Analysis: Spike-in Benchmark

5. Single-Cell: PBMC 68K

6. Single-Cell: Human Pancreas

7. Bulk RNA-seq: Rice Leaf Drought and Heat Stress

Output Structure (per benchmark)

Reviewer Reports

Reproducing the Spike-in Benchmark

Data Sources

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages