This repository contains a multi-stage pipeline for automatically generating, filtering, and refining Python functions that encode specific retrosynthesis strategies. The ultimate goal is to create a high-quality, structured library of these functions for use in route annotation, analysis, and steerable retrosynthesis models.
The workflow is orchestrated through a series of scripts located in the scripts/ directory.
- Workflow Overview
- Prerequisites
- Installation
- CLI Usage
- Step-by-Step Guide
- Advanced Usage: SLURM Job Submission
- Local Execution (Non-SLURM)
- Directory Structure
The entire process is designed to be run sequentially, where the output of one step becomes the input for the next.
-
Generate Functions (
01_run_function_generation.sh): An LLM is prompted with chemical reaction data to generate raw Python functions that attempt to identify synthetic strategies. -
Extract Functions (
02_extract_functions_from_json.py): The raw JSON output from the generation step is parsed, and each valid function is saved into its own.pyfile. -
Filter & Refine (
03_run_filtering_pipeline.py): This is the core multi-stage pipeline that uses LLMs to iteratively improve the quality of the generated functions:- Stage 1 (Flash Filter): A fast, inexpensive LLM performs an initial quality check, removing obvious failures.
- Stage 2 (Pro Filter): A more powerful LLM performs a deeper analysis, fixing minor bugs and improving descriptions.
- Stage 3 (Metadata Generation): The logic of each passing function is deconstructed into a structured JSON format.
- Stage 4 (Return Modification): The function's code is automatically refactored to return not just a boolean, but the structured metadata of what it found.
-
Annotate Routes (
synth-strategy annotate): Use the CLI to apply the function library to synthesis routes, identifying which strategies are present in each route. -
Cluster Strategies (
synth-strategy cluster): Perform clustering analysis to discover common strategy patterns and their relationships. -
Visualize Results (
synth-strategy visualize): Generate visualizations to understand strategy distributions and clustering results.
Note: The CLI (
synth-strategy) is the recommended way to use the function library for annotation, clustering, and visualization. For large-scale HPC processing, SLURM scripts are also available (see Advanced Usage).
- Python 3.9+
- A Conda or other virtual environment management tool.
- An API key for an LLM provider (e.g., OpenRouter) set as an environment variable:
git clone https://github.com/yourusername/synth-strategy-pipeline.git
cd synth-strategy-pipelineCreate and activate a conda environment:
conda create --name synth_strategy_env python=3.10
conda activate synth-strategy
conda install -c conda-forge rdkitInstall as required:
pip install --upgrade -e .The pipeline requires API keys for LLM providers. Set these as environment variables:
# OpenRouter API key (required for function generation and filtering)
export OPENROUTER_API_KEY="your_openrouter_api_key_here"
To make these permanent, add them to your shell configuration file (~/.bashrc, ~/.zshrc, etc.):
echo 'export OPENROUTER_API_KEY="your_openrouter_api_key_here"' >> ~/.bashrc
source ~/.bashrcTo get started quickly with pre-annotated USPTO route data, run the download script:
python scripts/download_uspto_data.pyThis will:
- Download 4 annotated route files from Figshare (~2.5GB total)
- Extract them to
data/uspto_st/ - Verify file integrity
The downloaded data includes thousands of annotated synthesis routes ready for retrieval and clustering analysis.
Manual Download: If you prefer to download manually, the data is available at: https://figshare.com/account/articles/30146374
The synth-strategy CLI provides a unified interface for route annotation, clustering, retrieval, and visualization. After installation, the CLI is available globally as synth-strategy.
annotate- Annotate routes with strategy functionscluster- Perform strategy clustering analysisretrieve- Search for strategies using natural language queriesvisualize- Generate visualizations from results
Annotate synthesis routes by applying strategy functions to identify which strategies are present in each route.
Basic Usage:
# Annotate routes from a directory
synth-strategy annotate \
--input-dir data/routes/ \
--functions-dir data/strategy_function_library/ \
--output results/
# Annotate routes from a single file
synth-strategy annotate \
--input-file data/test/test_routes.json \
--functions-dir data/strategy_function_library/ \
--output results/Advanced Usage:
# Annotate and cluster in one step
synth-strategy annotate \
--input-dir data/routes/ \
--functions-dir data/strategy_function_library/ \
--output results/ \
--cluster \
--visualize \
--interactive
# Use multiple workers for faster processing
synth-strategy annotate \
--input-file data/routes.json \
--functions-dir data/strategy_function_library/ \
--workers 8 \
--output results/Options:
--input-dir- Directory containing route JSON files--input-file- Single route JSON file--functions-dir- Directory containing strategy function Python files--output/-o- Output directory for results--workers- Number of parallel workers (default: CPU count)--cluster- Run clustering after annotation--visualize- Generate visualizations (requires --cluster)--interactive- Create interactive visualizations--config/-c- Configuration file path
Output:
Creates annotated_routes.json with routes annotated with:
passing_functions- Strategy functions that matched the routeerrored_functions- Functions that encountered errors
Perform clustering analysis on annotated routes to identify common strategy patterns.
Basic Usage:
# Cluster annotated routes
synth-strategy cluster \
--input-file results/annotated_routes.json \
--functions-dir data/strategy_function_library/ \
--output results/
# Cluster and visualize
synth-strategy cluster \
--input-dir data/annotated_routes/ \
--functions-dir data/strategy_function_library/ \
--output results/ \
--visualizeOptions:
--input-dir- Directory containing annotated route JSON files--input-file- Single annotated route JSON file--functions-dir- Directory containing strategy function source code (for docstring extraction)--output/-o- Output directory for results--visualize- Generate visualizations after clustering
Output:
Creates clustering_results.json containing:
optimal_k- Optimal number of clusters foundcluster_allocations- Mapping of routes to cluster IDscluster_defining_features- Key strategies that define each clusterfunction_docstrings- Documentation for each strategy function
Search for synthesis strategies using natural language queries or structured queries.
Basic Usage:
# Simple text query
synth-strategy retrieve \
--query "oxidation strategy for alcohol to aldehyde" \
--metadata-db data/metadata.db \
--route-db data/routes/ \
--top-k 10 \
--output results/
# Complex query from JSON file
synth-strategy retrieve \
--query-file queries/complex_query.json \
--metadata-db data/metadata.db \
--route-db data/routes/ \
--top-k 20 \
--output results/ \
--visualizeOptions:
--query- Natural language text query--query-file- JSON file containing structured query--metadata-db- Path to metadata database--route-db- Path to route database directory--embedding-cache- Path to embedding cache (optional)--top-k- Number of results to return--output/-o- Output directory--visualize- Generate visualizations
Output:
Creates retrieval_results.json with ranked strategy matches.
The retrieval system uses a sophisticated two-stage process to find synthesis routes matching your query:
Stage 1: Query Rewriting (LLM-powered)
Your natural language query is automatically converted to a structured JSON format by an LLM (default: google/gemini-2.5-pro). This structured query contains:
- A natural language description for semantic search
- Exact filters for precise matching using controlled vocabularies
Example query: "late stage amide coupling, preservation of a piperidine core and early stage ring formation"
Gets rewritten to:
{
"operator": "AND",
"queries": [
{
"query": {
"natural_language_description": "Detects late-stage amide coupling strategy",
"filters": {
"OR": {
"named_reactions": [
"Acyl chloride with primary amine to amide",
"Carboxylic acid with primary amine to amide",
"amide_formation"
]
}
}
}
},
{
"query": {
"natural_language_description": "Preservation of piperidine core",
"filters": {"ring_systems": ["piperidine"]}
}
},
{
"query": {
"natural_language_description": "Early-stage ring formation",
"filters": {"named_reactions": ["ring_formation"]}
}
}
]
}Stage 2: Two-Phase Retrieval
For each sub-query, the system performs:
-
Semantic Search (Broad Funnel)
- Uses embeddings to find functions with similar descriptions
- Configurable via
--top-n-functions(e.g., top 40 most similar) - Creates a large candidate pool
-
Exact Filtering (Narrowing Down)
- Performs strict exact-match filtering on atomic checks
- Uses controlled vocabularies:
named_reactions,ring_systems,functional_groups - Only keeps functions that exactly match filter criteria
-
Route Matching
- Finds routes containing the filtered functions
- Checks instance-level matching (specific reaction details)
- Scores routes by semantic similarity
-
Ranking
- Primary: Number of sub-queries matched
- Secondary: Average cosine similarity score
- Returns top-k results
Configuring Retrieval Behavior:
The system is highly configurable through multiple mechanisms:
- Configuration File (
config.yaml):
retrieval:
embedding_model: "all-MiniLM-L6-v2"
top_k: 10
top_n_functions: 40 # null = use all functions
metadata_db: "data/function_metadata_database.json"
route_db_dir: "data/routes"
embedding_cache: "data/retrieval_embeddings"- CLI Override:
synth-strategy retrieve \
--query "your query" \
--metadata-db path/to/metadata.json \
--route-db path/to/routes/ \
--embedding-cache path/to/cache.pkl \
--top-k 20- Python API:
from synth_strategy.retrieval.retriever import StrategyRetriever
from synth_strategy.models import SentenceTransformerEmbedder
embedder = SentenceTransformerEmbedder(model_name="all-MiniLM-L6-v2")
retriever = StrategyRetriever(
metadata_db_path="data/function_metadata_database.json",
route_db_dir="data/routes",
embedding_cache_path="data/retrieval_embeddings/cache.pkl",
embedder=embedder
)
results = retriever.retrieve_complex(query_dict, top_k=10, top_n_functions=40)Using a New Route Database:
The system is designed to easily work with new route databases:
- Prepare Your Routes: Routes must be annotated with strategy functions first:
synth-strategy annotate \
--input-dir new_routes/ \
--functions-dir data/strategy_functions/ \
--output annotated_routes/- Point to New Database:
synth-strategy retrieve \
--query "your query" \
--route-db annotated_routes/ \
--metadata-db data/function_metadata_database.json- Automatic Processing:
- The system scans all
*.jsonfiles in the directory - Builds an inverted index (function → routes mapping)
- Caches the index for faster subsequent loads
- Auto-invalidates cache when source files change
- The system scans all
Data Requirements:
The retrieval system requires:
- Metadata Database:
function_metadata_database.jsoncontaining function descriptions and atomic checks - Embedding Cache: Pre-computed embeddings for all function descriptions
- Route Database: Directory of annotated route JSON files with
passing_functionsfield - Inverted Index: Auto-generated and cached in
.retriever_cache/
Performance Optimization:
- Caching: Inverted index and embeddings are cached for fast repeated queries
- Filtering: Only functions that pass at least once in your route database are loaded
- Parallel Processing: Embedding computation can be parallelized
Generate visualizations from clustering or retrieval results.
Usage:
# Visualize clustering results
synth-strategy visualize \
--type clustering \
--input results/clustering_results.json \
--annotated_dir data/annotated_routes/ \
--output visualizations/
# Visualize retrieval results
synth-strategy visualize \
--type retrieval \
--input results/retrieval_results.json \
--output visualizations/Options:
--type- Type of visualization:clusteringorretrieval(required)--input/-i- Input results file (required)--output/-o- Output directory--annotated_dir- Directory containing annotated routes (required for clustering visualization)
Workflow 1: Quick Analysis
# Annotate, cluster, and visualize in one command
synth-strategy annotate \
--input-file data/routes.json \
--functions-dir data/strategy_function_library/ \
--output results/ \
--cluster \
--visualize \
--interactiveWorkflow 2: Step-by-Step Analysis
# Step 1: Annotate routes
synth-strategy annotate \
--input-dir data/routes/ \
--functions-dir data/strategy_function_library/ \
--output results/
# Step 2: Cluster annotated routes
synth-strategy cluster \
--input-file results/annotated_routes.json \
--functions-dir data/strategy_function_library/ \
--output results/
# Step 3: Visualize results
synth-strategy visualize \
--type clustering \
--input results/clustering_results.json \
--annotated_dir data/annotated_routes/ \
--output visualizations/Workflow 3: Large-Scale Processing
# Process large datasets with parallel workers
synth-strategy annotate \
--input-dir data/large_dataset/ \
--functions-dir data/strategy_function_library/ \
--workers 16 \
--output results/
# Cluster with visualization
synth-strategy cluster \
--input-file results/annotated_routes.json \
--functions-dir data/strategy_function_library/ \
--output results/ \
--visualizeYou can use a configuration file (YAML format) to set default parameters:
# config.yaml
defaults:
functions_dir: "data/strategy_function_library"
output_dir: "results"
workers: 8
retrieval:
metadata_db: "data/metadata.db"
route_db_dir: "data/routes"
embedding_cache: "data/embeddings.pkl"
top_k: 10Use with any command:
synth-strategy annotate --config config.yaml --input-file data/routes.jsonRoute JSON Format:
Routes should be in JSON format as a list of route objects:
[
{
"type": "mol",
"smiles": "CC(=O)O",
"metadata": {
"target_smiles": "CC(=O)O"
},
"children": [
{
"type": "reaction",
"metadata": {
"mapped_reaction_smiles": "CCO>>CC(=O)O"
},
"children": [...]
}
]
}
]Issue: "Not enough routes for clustering"
- Ensure you have at least 2 annotated routes
- Check that routes have
passing_functionspopulated
Issue: "No passing functions found"
- Verify that the strategy functions are compatible with your route format
- Check function error logs in the annotated routes
Issue: "Module not found" errors
- Ensure the package is installed:
pip install -e . - Activate the correct conda environment
All commands should be run from the scripts/ directory.
This step uses an LLM to generate the initial set of raw strategy functions.
Script: 01_run_function_generation.sh
This is a bash script that calls the underlying Python generation logic. Before running, you can configure the parameters inside the script, such as the input data, number of samples, and which LLM to use.
How to Run:
cd scripts/
./01_run_function_generation.shOutput: A JSON file (e.g., train_t_google_gemini-1.5-flash_results.json) will be created in the output directory specified within the script (default: ../data/function_lib_test).
This script parses the JSON output from Step 1 and creates individual .py files for each successfully generated function.
Script: 02_extract_functions_from_json.py
How to Run:
You need to provide the path to the JSON file generated in the previous step and specify an output directory for the .py files.
# Replace with the actual path to your generated JSON
JSON_FILE="../data/function_lib_test/train_t_google_gemini-1.5-flash_results.json"
# This directory will contain all the raw Python functions
OUTPUT_DIR="../data/generated_functions_raw"
python 02_extract_functions_from_json.py "$JSON_FILE" --output-dir "$OUTPUT_DIR"Output: The ../data/generated_functions_raw directory will be populated with many .py files. This directory will be the main input for the next step.
This is the main orchestration script that takes the raw functions and puts them through the four-stage refinement process.
Script: 03_run_filtering_pipeline.py
How to Run:
The script is highly configurable via command-line arguments. The most important one is --source-code-dir, which must point to the directory created in Step 2.
# Run the full pipeline on the raw functions
python 03_run_filtering_pipeline.py \
--source-code-dir ../data/generated_functions_rawKey Parameters:
--source-code-dir: (Required) Path to the input directory of.pyfunction files.--base-output-dir: Where the intermediate and final code artifacts will be stored. Default:./pipeline_output.--reports-dir: Where JSON reports for each stage will be saved. Default:../data/filtering_data.--run-stages: Specify which stages to run (e.g.,--run-stages 1 2to only run the filtering). Default:1 2 3 4.--flash-filter-level: Choose what to keep from the first stage.perfect(default) orgood_and_perfect.
Output:
./pipeline_output/: This directory will contain the outputs of each stage in numbered subdirectories. The final, high-quality, and refactored functions will be located in./pipeline_output/4_modified_functions_output/.../data/filtering_data/: This will contain JSON reports detailing the results of each stage (e.g.,flash_filter_report.json,modification_report.json).
For large-scale processing, the final function library can be used to annotate routes on a high-performance computing (HPC) cluster using SLURM.
Script: submit_annotation_jobs.sh
This script submits a SLURM job array to process thousands of route files in parallel. Each task in the array annotates one route file with the entire library of strategy functions.
Configuration:
Before submitting, you must edit the "USER CONFIGURATION" section inside the script to set the correct paths for your project directory, input/output folders, and Conda environment.
How to Run:
sbatch submit_annotation_jobs.shScript: submit_clustering_job.sh
After the annotation jobs are complete, this script runs a final analysis job. It aggregates all the annotated route data and performs a clustering analysis to identify which strategies are most common and which tend to co-occur.
Configuration:
Similarly, you must edit the "USER CONFIGURATION" section to match your environment paths.
How to Run:
sbatch submit_clustering_job.shFor users without access to a SLURM-based HPC cluster, a local execution script is provided that can run the annotation and clustering steps on a single machine.
Script: run_local.sh
This script orchestrates a two-step analysis pipeline without using SLURM:
-
Step 1 - Annotation: Iterates through all raw route files and runs the
annotation.pyscript for each file in parallel, up to a user-defined limit (MAX_PARALLEL_JOBS). -
Step 2 - Clustering: After all annotation jobs complete, runs
clustering.pyonce to aggregate the results and produce a final analysis file.
Before running, you must edit the "USER CONFIGURATION" section inside the script to set:
PROJECT_DIR: Base path for your projectTOTAL_FILES: Number of files to process (replaces SLURM's--array=1-5000)MAX_PARALLEL_JOBS: How many annotation processes to run simultaneously (recommended: half of your CPU cores)CPUS_PER_TASK: How many CPUs each individual annotation process can use for its internal multiprocessing- Input/output directories:
RAW_ROUTES_DIR: Location of input route filesANNOTATED_ROUTES_DIR: Where annotated files will be savedRESULTS_DIR: Where final analysis will be savedSTRATEGY_FUNCTIONS_DIR: Location of strategy functions library
CONDA_ENV_NAME: Your conda environment name- Conda installation path (update the
sourcepath in the script)
# 1. Make the script executable
chmod +x run_local.sh
# 2. Run the script
./run_local.sh- Annotation logs: Individual log files for each task will be saved in
logs/annotation_XXXX.log - Annotated routes: Saved in the
ANNOTATED_ROUTES_DIRdirectory - Final analysis:
strategy_clustering_analysis.jsonin theRESULTS_DIR
- The
MAX_PARALLEL_JOBSsetting controls memory usage and CPU load. Start conservatively and adjust based on your system's performance. - Processing time will depend on your machine's specifications and the complexity of the route files.
- Monitor system resources during execution to optimize the parallelism settings for your hardware.
A typical project layout after running the pipeline:
.
├── data/
│ ├── function_lib_test/ # Raw JSON output from Step 1
│ ├── generated_functions_raw/ # Individual .py files from Step 2
│ └── filtering_data/ # Reports from the main pipeline (Step 3)
│
├── pipeline_output/
│ ├── 1_flash_filter_output/
│ │ └── functions/ # Functions that passed Stage 1
│ ├── 2_pro_filter_output/
│ │ └── functions/ # Functions that passed Stage 2
│ ├── 3_metadata_output/
│ │ └── function_metadata_database.json
│ └── 4_modified_functions_output/ # FINAL, REFINED FUNCTION LIBRARY
│
├── scripts/
│ ├── 01_run_function_generation.sh
│ ├── 02_extract_functions_from_json.py
│ ├── 03_run_filtering_pipeline.py
│ ├── submit_annotation_jobs.sh
│ └── submit_clustering_job.sh
│
└── src/
└── synth_strategy/
└── llm/ # Core Python modules for the pipeline