Code & Curated Dataset
This repository contains the code and curated data accompanying:
Neves, P.; Hao, B.; Aikonen, S. et al. “Robust Out-of-Distribution Prediction of Buchwald–Hartwig Reactions” (2025), ChemRxiv, DOI: 10.26434/chemrxiv-2025-xcr46
It provides:
- A unified, curated Buchwald–Hartwig (BH) high-throughput experimentation (HTE) dataset (JnJ dedicated production + open-source)
- Notebooks to reproduce results:
- Out-of-distribution (OOD) benchmarking across data sources
- The Compound–Reaction Diversity Score (CRDS) vs OOD performance
- The impact produced JnJ HTE dataset when added to existing data on model performance
The code was developed and tested with:
- Python 3.8.13
- scikit-learn 1.1.2
To reproduce the results, we recommend using the same python and scikit-learn versions. rdkit version used was 2021.03.3, but DRFP and PhyChem features are already included.
conda create -n bh_ood python=3.8
conda activate bh_ood
pip install "scikit-learn==1.1.2" numpy pandas scipy tqdm matplotlibInstallation typically takes under 5 minutes via pip, and the single demo benchmark runs in approximately 10 minutes on a standard desktop computer.
File: data/BH_HTE_Curated_Dataset_v2025-11.csv
Pickle Version:: Has additional columns, due to size it is hosted at https://zenodo.org/records/17634928
The .csv used as pandas.DataFrame contains:
- Reaction identity
- Aryl and amine SMILES (e.g.
Aryl SMILES,Amine SMILES) - Reagent SMILES (e.g.
Catalyst SMILES,Solvent SMILES,Base SMILES) - Yield / response information and a binary success label as used in the manuscript
- Aryl and amine SMILES (e.g.
- Metadata
Source: origin of each reaction (e.g. literature dataset name or internal JnJ campaign)
- Features / clustering
- Columns for features (e.g. DRFP / PhysChem-based, depending on preprocessing)
- Cluster assignments:
iteration_0 cluster,iteration_1 cluster, …
These are used to build in-domain vs out-of-domain (OOD) train/test splits, because reagents are included in fingerprint definition of OOD doesn't garantee substrates OOD, for the data source benchmark, the substrate novelty is enforced in the code.
This table merges:
- Newly generated JnJ BH HTE data
- Curated open-source BH / Pd-catalyzed C–N HTE datasets (see Dataset references below)
- Critically for generalization across sources, every chemical entity was curated with an internal tool to have a single consistent SMILES and Name representation.
Used in the CRDS Notebook via load_dataset_summary.
Format (per row):
Unique Reactant Pairs, Unique Reagent Sets, Reaction Samples, Dataset Name
Each row corresponds to a train dataset source included in the Unified BH HTE dataset.
These are produced by running the notebooks, and constitute the results tables shown in S.I.:
results/Data_Source_{features}_all_YYYY-M-D.csv
All model / train-source combinations, aggregated over random splits.results/Data_Source_{features}_best_YYYY-M-D.csv
One row per train data source: best model according toROC AUC Avg + F1 Score Avg.results/JnJ25_{True/False}_Train_{features}_all_YYYY-M-D.pkl
Raw per-run performance for JnJ-inclusion experiments.results/JnJ25_{True/False}_Train_{features}_best_{year}-{month}-{day}.csv
Aggregated, best-model summary for JnJ inclusion/exclusion experiments.
These correspond to the tables and curves in the manuscript (e.g., per-source OOD performance and JnJ impact analysis).
Goal:
Quantify how the Compound–Reaction Diversity Score (CRDS) correlates with out-of-distribution (OOD) performance and compare this to raw dataset size.
Key functions:
calculate_crds(...)
Computes CRDS from tuples[unique_pairs, unique_reagent_combinations, total_reactions, name], with tunable exponents and thresholds.summarize_sources_with_combined(df)
Aggregates the unified dataset bySource, and for “All except JNJ HTE 2024” and “All sources”.calculate_correlations(df, "CRDS", [...])
Computes Pearson and Spearman correlations between CRDS (or dataset size) and:ROC AUC AvgBalanced Accuracy AvgF1 Score AvgAU-PR-C Avg
Inputs:
data/dataset_summary.csvresults/Data_Source_best.csv(generated by the Data Sources Benchmark Notebook)
Outputs:
- Correlation tables for:
- CRDS vs OOD metrics
- Simplified CRDS vs OOD metrics
- Dataset size vs OOD metrics
This underpins the result that diversity (via CRDS) correlates more strongly with OOD performance than raw dataset size.
Goal:
Benchmark multiple ML models trained on one data source at a time and tested on OOD clusters built from remaining data.
High-level workflow:
-
Select feature representation
# Either "DRFP" or "QM" features = "DRFP"
-
Load unified dataset
with open("data/HTE_BH_10_iter_9-3_Clusters_v2025-11.pkl", "rb") as f: df_HTE_BH = pickle.load(f)
-
Define model panel
model_names = [ "rf_model", "gradient_boosting_model", "lr_model", "mlp_model", "knn_model", "gaussianNB_model", ]
These models are implemented in
utils/ml_train.py. -
Train / evaluate across random OOD splits
For each of 10 runs (different
random_state):- For each
main_sourceindf_HTE_BH.Source.unique():df_train: all reactions from that source.- Use
iteration_{run} clusterto define an OOD test set (“outer clusters”). - Enforce strict OOD by filtering out any test reactions whose aryl or amine appears in
df_train. - Train models via
train_models(...). - Evaluate via
calculate_metrics(...), storing per-run metrics indf_perf.
- For each
-
Aggregate metrics
-
For each
(model, train_data)combination, compute mean and std over runs for:Balanced AccuracyROC AUCF1 ScoreF0 ScoreAU-PR-CPrecision 1Recall 1
-
Select the best model per train data source based on
ROC AUC Avg + F1 Score Avg.
-
-
Save results
- Full panel:
results/Data_Source_{features}_all_{year}-{month}-{day}.csv - Best per source:
results/Data_Source_{features}_best_{year}-{month}-{day}.csv
- Full panel:
These outputs are used directly by the CRDS Notebook and correspond to the per-source OOD performance reported in the manuscript.
Goal:
Quantify the impact of including the new JnJ industrial HTE dataset in training on OOD performance and calibration, including confidence-bucketed enrichment curves.
Key parameters:
# Either "DRFP" or "QM"
features = "DRFP"
# Exclusion of JnJ HTE from training
no_jjhte_train = FalseHigh-level workflow:
-
Load the unified dataset.
-
For each of 10 random seeds:
- Call
train_test_ood_cluster_static(...)(from your utilities) to:- Build OOD splits with and without JnJ data in training.
- Train the model panel.
- Collect global metrics and per-confidence-bucket statistics.
- Call
-
Aggregate across runs using:
calculate_model_performance(...)combine_per_bucket_results(...)
-
Save:
results/JnJ25_{not(no_jjhte_train)}_Train_{features}_all_{year}-{month}-{day}.pkl
(raw results for all runs)results/JnJ25_{not(no_jjhte_train)}_Train_{features}_best_{year}-{month}-{day}.csv
(aggregated best-model metrics)
These results correspond to the “with vs without JnJ” comparisons and the discussion of how the new dataset boosts OOD performance.
-
Benchmark per-source OOD performance
- Open
Data_Sources_Benchmark_Notebook.ipynb - Set
features = "DRFP"(or"QM"if applicable) - Run all cells
→ Generatesresults/Data_Source_{features}_all_*.csvandresults/Data_Source_{features}_best_*.csv
- Open
-
CRDS vs OOD performance
- Ensure
results/Data_Source_best.csvpoints to the latestData_Source_{features}_best_*.csv - Open
CRDS_Notebook.ipynb - Run all cells
→ Prints correlation tables (CRDS / simplified CRDS / size vs OOD metrics)
- Ensure
-
Impact of JnJ dataset
- Open
JnJ25_Data_Impact_Notebook.ipynb - Choose
featuresandno_jjhte_train - Run all cells
→ Producesresults/JnJ25_*files used for the JnJ inclusion/exclusion analysis
- Open
If you use this code or the curated BH dataset, please cite:
Neves, P.; Hao, B.; Aikonen, S.; Diccianni, J. B.; Wegner, J. K.; Schwaller, P.; Strambeanu, I. I.
Robust Out-of-Distribution Prediction of Buchwald–Hartwig Reactions.
ChemRxiv (2025), DOI: 10.26434/chemrxiv-2025-xcr46.
The curated BH dataset integrates and standardizes multiple open-source HTE BH / reactivity datasets, together with newly generated industrial HTE data. Please also cite the original data sources where appropriate.
-
Santanilla, A. B. et al. Nanomole-scale high-throughput chemistry for the synthesis of complex molecules. Science, 2015, 347, 49–53.
-
Ahneman, D. T.; Estrada, J. G.; Lin, S.; Dreher, S. D.; Doyle, A. G.
Predicting Reaction Performance in C–N Cross-Coupling Using Machine Learning.
Science 2018, 360, 186–190. DOI: 10.1126/science.aar5169. -
Rinehart, N. I.; Saunthwal, R. K.; Wellauer, J.; Zahrt, A. F.; Schlemper, L.; Shved, A. S.; Bigler, R.; Fantasia, S.; Denmark, S. E.
A Machine-Learning Tool to Predict Substrate-Adaptive Conditions for Pd-Catalyzed C–N Couplings.
Science 2023, 381, 965–972. -
Fitzner, M.; Wuitschik, G.; Koller, R. J.; Adam, J.-M.; Schindler, T.; Reymond, J.-L.
What Can Reaction Databases Teach Us about Buchwald–Hartwig Cross-Couplings?
Chem. Sci. 2020, 11, 13085–13093. -
Saebi, M.; Nan, B.; Herr, J. E.; Bustillo, L.; Wiest, O.; Shields, B. J.; et al.
On the Use of Real-World Datasets for Reaction Yield Prediction.
Chem. Sci. 2023, 14, 1671–1685. -
King-Smith, E.; et al.
Probing the Chemical “Reactome” with High-Throughput Experimentation Data.
Nat. Chem. 2024, 16, 633–643. -
Ha, S. K.; et al.
Developing Pharmaceutically Relevant Pd-Catalyzed C–N Coupling Reactivity Models Leveraging High-Throughput Experimentation.
J. Am. Chem. Soc. 2025, 147, 19602–19613.