Robust Out-of-Distribution Prediction of Buchwald–Hartwig Reactions

Code & Curated Dataset

This repository contains the code and curated data accompanying:

Neves, P.; Hao, B.; Aikonen, S. et al. “Robust Out-of-Distribution Prediction of Buchwald–Hartwig Reactions” (2025), ChemRxiv, DOI: 10.26434/chemrxiv-2025-xcr46

It provides:

A unified, curated Buchwald–Hartwig (BH) high-throughput experimentation (HTE) dataset (JnJ dedicated production + open-source)
Notebooks to reproduce results:
- Out-of-distribution (OOD) benchmarking across data sources
- The Compound–Reaction Diversity Score (CRDS) vs OOD performance
- The impact produced JnJ HTE dataset when added to existing data on model performance

Installation & environment

The code was developed and tested with:

Python 3.8.13
scikit-learn 1.1.2

To reproduce the results, we recommend using the same python and scikit-learn versions. rdkit version used was 2021.03.3, but DRFP and PhyChem features are already included.

conda create -n bh_ood python=3.8
conda activate bh_ood

pip install "scikit-learn==1.1.2" numpy pandas scipy tqdm matplotlib

Installation typically takes under 5 minutes via pip, and the single demo benchmark runs in approximately 10 minutes on a standard desktop computer.

Data

1. Unified BH HTE dataset

File: data/BH_HTE_Curated_Dataset_v2025-11.csv
Pickle Version:: Has additional columns, due to size it is hosted at https://zenodo.org/records/17634928

The .csv used as pandas.DataFrame contains:

Reaction identity
- Aryl and amine SMILES (e.g. Aryl SMILES, Amine SMILES)
- Reagent SMILES (e.g. Catalyst SMILES, Solvent SMILES, Base SMILES)
- Yield / response information and a binary success label as used in the manuscript
Metadata
- Source: origin of each reaction (e.g. literature dataset name or internal JnJ campaign)
Features / clustering
- Columns for features (e.g. DRFP / PhysChem-based, depending on preprocessing)
- Cluster assignments: iteration_0 cluster, iteration_1 cluster, …
  These are used to build in-domain vs out-of-domain (OOD) train/test splits, because reagents are included in fingerprint definition of OOD doesn't garantee substrates OOD, for the data source benchmark, the substrate novelty is enforced in the code.

This table merges:

Newly generated JnJ BH HTE data
Curated open-source BH / Pd-catalyzed C–N HTE datasets (see Dataset references below)
Critically for generalization across sources, every chemical entity was curated with an internal tool to have a single consistent SMILES and Name representation.

2. `dataset_summary.csv`

Used in the CRDS Notebook via load_dataset_summary.

Format (per row):

Unique Reactant Pairs, Unique Reagent Sets, Reaction Samples, Dataset Name

Each row corresponds to a train dataset source included in the Unified BH HTE dataset.

3. Benchmark result files

These are produced by running the notebooks, and constitute the results tables shown in S.I.:

results/Data_Source_{features}_all_YYYY-M-D.csv
All model / train-source combinations, aggregated over random splits.
results/Data_Source_{features}_best_YYYY-M-D.csv
One row per train data source: best model according to ROC AUC Avg + F1 Score Avg.
results/JnJ25_{True/False}_Train_{features}_all_YYYY-M-D.pkl
Raw per-run performance for JnJ-inclusion experiments.
results/JnJ25_{True/False}_Train_{features}_best_{year}-{month}-{day}.csv
Aggregated, best-model summary for JnJ inclusion/exclusion experiments.

These correspond to the tables and curves in the manuscript (e.g., per-source OOD performance and JnJ impact analysis).

Notebooks

1. CRDS Notebook

Goal:
Quantify how the Compound–Reaction Diversity Score (CRDS) correlates with out-of-distribution (OOD) performance and compare this to raw dataset size.

Key functions:

calculate_crds(...)
Computes CRDS from tuples [unique_pairs, unique_reagent_combinations, total_reactions, name], with tunable exponents and thresholds.
summarize_sources_with_combined(df)
Aggregates the unified dataset by Source, and for “All except JNJ HTE 2024” and “All sources”.
calculate_correlations(df, "CRDS", [...])
Computes Pearson and Spearman correlations between CRDS (or dataset size) and:
- ROC AUC Avg
- Balanced Accuracy Avg
- F1 Score Avg
- AU-PR-C Avg

Inputs:

data/dataset_summary.csv
results/Data_Source_best.csv (generated by the Data Sources Benchmark Notebook)

Outputs:

Correlation tables for:
- CRDS vs OOD metrics
- Simplified CRDS vs OOD metrics
- Dataset size vs OOD metrics

This underpins the result that diversity (via CRDS) correlates more strongly with OOD performance than raw dataset size.

2. Data Sources Benchmark Notebook

Goal:
Benchmark multiple ML models trained on one data source at a time and tested on OOD clusters built from remaining data.

High-level workflow:

Select feature representation

# Either "DRFP" or "QM"
features = "DRFP"

Load unified dataset

with open("data/HTE_BH_10_iter_9-3_Clusters_v2025-11.pkl", "rb") as f:
    df_HTE_BH = pickle.load(f)

Define model panel

model_names = [
    "rf_model",
    "gradient_boosting_model",
    "lr_model",
    "mlp_model",
    "knn_model",
    "gaussianNB_model",
]

These models are implemented in utils/ml_train.py.

Train / evaluate across random OOD splits

For each of 10 runs (different random_state):
- For each main_source in df_HTE_BH.Source.unique():
  - df_train: all reactions from that source.
  - Use iteration_{run} cluster to define an OOD test set (“outer clusters”).
  - Enforce strict OOD by filtering out any test reactions whose aryl or amine appears in df_train.
  - Train models via train_models(...).
  - Evaluate via calculate_metrics(...), storing per-run metrics in df_perf.
Aggregate metrics
- For each (model, train_data) combination, compute mean and std over runs for:
  - Balanced Accuracy
  - ROC AUC
  - F1 Score
  - F0 Score
  - AU-PR-C
  - Precision 1
  - Recall 1
- Select the best model per train data source based on ROC AUC Avg + F1 Score Avg.
Save results
- Full panel: results/Data_Source_{features}_all_{year}-{month}-{day}.csv
- Best per source: results/Data_Source_{features}_best_{year}-{month}-{day}.csv

These outputs are used directly by the CRDS Notebook and correspond to the per-source OOD performance reported in the manuscript.

3. JnJ25 Data Impact Notebook

Goal:
Quantify the impact of including the new JnJ industrial HTE dataset in training on OOD performance and calibration, including confidence-bucketed enrichment curves.

Key parameters:

# Either "DRFP" or "QM"
features = "DRFP"

# Exclusion of JnJ HTE from training
no_jjhte_train = False

High-level workflow:

Load the unified dataset.
For each of 10 random seeds:
- Call train_test_ood_cluster_static(...) (from your utilities) to:
  - Build OOD splits with and without JnJ data in training.
  - Train the model panel.
  - Collect global metrics and per-confidence-bucket statistics.
Aggregate across runs using:
- calculate_model_performance(...)
- combine_per_bucket_results(...)
Save:
- results/JnJ25_{not(no_jjhte_train)}_Train_{features}_all_{year}-{month}-{day}.pkl
  (raw results for all runs)
- results/JnJ25_{not(no_jjhte_train)}_Train_{features}_best_{year}-{month}-{day}.csv
  (aggregated best-model metrics)

These results correspond to the “with vs without JnJ” comparisons and the discussion of how the new dataset boosts OOD performance.

How to reproduce key results

Benchmark per-source OOD performance
- Open Data_Sources_Benchmark_Notebook.ipynb
- Set features = "DRFP" (or "QM" if applicable)
- Run all cells
  → Generates results/Data_Source_{features}_all_*.csv and results/Data_Source_{features}_best_*.csv
CRDS vs OOD performance
- Ensure results/Data_Source_best.csv points to the latest Data_Source_{features}_best_*.csv
- Open CRDS_Notebook.ipynb
- Run all cells
  → Prints correlation tables (CRDS / simplified CRDS / size vs OOD metrics)
Impact of JnJ dataset
- Open JnJ25_Data_Impact_Notebook.ipynb
- Choose features and no_jjhte_train
- Run all cells
  → Produces results/JnJ25_* files used for the JnJ inclusion/exclusion analysis

Citation

If you use this code or the curated BH dataset, please cite:

Neves, P.; Hao, B.; Aikonen, S.; Diccianni, J. B.; Wegner, J. K.; Schwaller, P.; Strambeanu, I. I.
Robust Out-of-Distribution Prediction of Buchwald–Hartwig Reactions.
ChemRxiv (2025), DOI: 10.26434/chemrxiv-2025-xcr46.

Dataset references

The curated BH dataset integrates and standardizes multiple open-source HTE BH / reactivity datasets, together with newly generated industrial HTE data. Please also cite the original data sources where appropriate.

Santanilla, A. B. et al. Nanomole-scale high-throughput chemistry for the synthesis of complex molecules. Science, 2015, 347, 49–53.
Ahneman, D. T.; Estrada, J. G.; Lin, S.; Dreher, S. D.; Doyle, A. G.
Predicting Reaction Performance in C–N Cross-Coupling Using Machine Learning.
Science 2018, 360, 186–190. DOI: 10.1126/science.aar5169.
Rinehart, N. I.; Saunthwal, R. K.; Wellauer, J.; Zahrt, A. F.; Schlemper, L.; Shved, A. S.; Bigler, R.; Fantasia, S.; Denmark, S. E.
A Machine-Learning Tool to Predict Substrate-Adaptive Conditions for Pd-Catalyzed C–N Couplings.
Science 2023, 381, 965–972.
Fitzner, M.; Wuitschik, G.; Koller, R. J.; Adam, J.-M.; Schindler, T.; Reymond, J.-L.
What Can Reaction Databases Teach Us about Buchwald–Hartwig Cross-Couplings?
Chem. Sci. 2020, 11, 13085–13093.
Saebi, M.; Nan, B.; Herr, J. E.; Bustillo, L.; Wiest, O.; Shields, B. J.; et al.
On the Use of Real-World Datasets for Reaction Yield Prediction.
Chem. Sci. 2023, 14, 1671–1685.
King-Smith, E.; et al.
Probing the Chemical “Reactome” with High-Throughput Experimentation Data.
Nat. Chem. 2024, 16, 633–643.
Ha, S. K.; et al.
Developing Pharmaceutically Relevant Pd-Catalyzed C–N Coupling Reactivity Models Leveraging High-Throughput Experimentation.
J. Am. Chem. Soc. 2025, 147, 19602–19613.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
results		results
utils		utils
.gitignore		.gitignore
CRDS.ipynb		CRDS.ipynb
Data Sources Benchmark.ipynb		Data Sources Benchmark.ipynb
JnJ25 Impact.ipynb		JnJ25 Impact.ipynb
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Robust Out-of-Distribution Prediction of Buchwald–Hartwig Reactions

Installation & environment

Data

1. Unified BH HTE dataset

2. `dataset_summary.csv`

3. Benchmark result files

Notebooks

1. CRDS Notebook

2. Data Sources Benchmark Notebook

3. JnJ25 Data Impact Notebook

How to reproduce key results

Citation

Dataset references

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

schwallergroup/bh-hte-ood

Folders and files

Latest commit

History

Repository files navigation

Robust Out-of-Distribution Prediction of Buchwald–Hartwig Reactions

Installation & environment

Data

1. Unified BH HTE dataset

2. dataset_summary.csv

3. Benchmark result files

Notebooks

1. CRDS Notebook

2. Data Sources Benchmark Notebook

3. JnJ25 Data Impact Notebook

How to reproduce key results

Citation

Dataset references

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

2. `dataset_summary.csv`

Packages