Enhancing diversity in language based models for single-step retrosynthesis

Enable diversity in single-step retrosynthesis models. The models were trained using the OpenNMT-py framework.

Install / Build

Create Environment and Install

# Create environmentt
conda create -n rxn-cluster-token-prompt python=3.7
conda activate rxn-cluster-token-prompt

# Add required initial packages (rxnfp may require also Rust to be installed)
pip install rdkit

# Clone or download the repository (not shown), `cd` into it, and install it as a Python package
cd rxn_cluster_token_prompt/
pip install -e .

For development

pip install -e .[dev]

When developing, before committing please run

black .
isort --profile black .
flake8
mypy .

To simplify the scripts run, export the path to this repo and the path where to store the dataset files

export REPO_PATH=/path/to/the/repository
export DATASET_OUTPUT=/your/output/directory

To use the open-source models download them in $REPO_PATH following this link and unzip them with:

tar xvf models.tgz

This will place the models in a folder called models, under $REPO_PATH.

Try it out!

You can easily try out the rxn cluster token prompt model for high diversity retrosynthesis predictions with 3 lines of code:

from rxn_cluster_token_prompt.model import RXNClusterTokenPrompt
retro_model = RXNClusterTokenPrompt(n_best=1)
retro_model.retro_predict(["CCN(CC)Cc1ccc(-c2nc(C)c(COc3ccc([C@H](CC(=O)N4C(=O)OC[C@@H]4Cc4ccccc4)c4ccon4)cc3)s2)cc1"], reorder_by_forward_likelihood=True, verbose=True)

The code above calls the default model (10clusters on USPTO). The n_best is the number of predictions per token to retain.

The output is a list of tuples, each containing (target, predicted_precursors, retro_confidence, predicted_product, forward_confidence, predicted_class)

To make predictions on a bigger dataset we recommend to use the procedure outlined below (after the USPTO dataset preparation), as the one above is not implemented for gpus.

USPTO Datasets generation

To generate the files for training the high diversity models as well as a forward and a classification model for the analysis, first download the dataset and preprocess it:

from rxn_cluster_token_prompt.uspto_datasets_loader import USPTOLoader
loader = USPTOLoader('USPTO_50K')
loader.download_dataset()
loader.process_dataset()

Results are saved in path_to_this_repo/data/uspto.

Then, you can generate the files for training and inference with the command:

generate-dataset-files \
  --input_csv_file ${REPO_PATH}/data/uspto/USPTO_50K_processed.csv \
  --output_path ${DATASET_OUTPUT} \
  --rxn-column-name reactions_can \
  --cluster-column-name class \
  --model-type retro

For options on how to use the command run generate-dataset-files --help. The model-type can be either retro,forward,classification

By specifying the --cluster-colum-name you can choose how to build your cluster token prompt model. The column class in USPTO contains the reaction classes. To see how to choose a different clustering technique, check the next section.

When the flag --baseline is passed together with the retro model type, the data for the baseline retrosynthesis model is generated.

Construction of the clusterers

In order to apply a clustering technique different from the reaction classification provided by Schneider et al, you can use the following script. As an example, to generate the clusterer for the 10clustersKmeans model:

First, set the following environment variables (examples are given as comments):

export FPS_SAVE_PATH="${REPO_PATH}/data/uspto/USPTO_50K_processed_fingerprints.pkl" # The absolute filepath where to store the computed fingerprints
export DATA_CSV_PATH="${REPO_PATH}/data/uspto/USPTO_50K_processed.csv" # The absolute path to the data on which to compute the fingerprints 
export RXN_SMILES_COLUMN="reactions_can" # The column name where the reactions are stored #

Then, run the script:

create-clusterer \
  --clusterer_pkl ${REPO_PATH}/data/uspto/USPTO_50K_processed_10clustersKmeans_clusterer.pkl \
  --pca_components 3 \
  --n_clusters 10

You can tune the number of pca components and the number of clusters to generate your clusterer.

Prediction of the cluster id

Once the clusterer is generated, you are ready to compute the cluster ids for the training/validation/test reactions. This will be added as an additional column to the input csv file.

cluster-csv \
  --input_csv ${REPO_PATH}/data/uspto/USPTO_50K_processed.csv \
  --output_csv ${REPO_PATH}/data/uspto/USPTO_50K_processed_10clustersKmeans.csv \
  --clusterer_pkl ${REPO_PATH}/data/uspto/USPTO_50K_processed_10clustersKmeans_clusterer.pkl

If you want, alternatively to fingerprints clustering, to generate random grouping of the reaction classes you can pass the option --n_clusters_random defining the number of wanted clusters. Run cluster-csv --help for more information.

You can now generate the files with:

generate-dataset-files \
  --input_csv_file ${REPO_PATH}/data/uspto/USPTO_50K_processed_10clustersKmeans.csv \
  --output_path ${DATASET_OUTPUT} \
  --rxn-column-name reactions_can \
  --cluster-column-name cluster_id \
  --model-type retro

The files will be saved under ${DATASET_OUTPUT}/random5, where 5 is the random seed used to generate the splits. You can change the seed with the --seed option.

Training

To train the models you can costumize the script bin/training.sh and run it on a system with one gpu. The USPTO models were trained up to 130000 steps (roughly 24 hours).

Prediction

Once your models are trained you can run the predictions with the custumizable script bin/translate.sh on a system with one gpu.

Evaluation

To evaluate your models you can customize the script bin/compute_metrics.sh. The output is a json file called metrics.json where the values of accuracy, round-trip accuracy, class-diversity and coverage are reported.

Citation

@unpublished{              
Toniato2022enhancing,              
title={Enhancing diversity in language based models for single-step retrosynthesis},              
author={Alessandra Toniato,Alain C. Vaucher,Philippe Schwaller,Teodoro Laino},              
journal={OpenReview Preprint},              
year={2022}            
}

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
bin		bin
data		data
rxn_cluster_token_prompt		rxn_cluster_token_prompt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

bin

bin

data

data

rxn_cluster_token_prompt

rxn_cluster_token_prompt

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

Enhancing diversity in language based models for single-step retrosynthesis

Install / Build

Create Environment and Install

Try it out!

USPTO Datasets generation

Construction of the clusterers

Prediction of the cluster id

Training

Prediction

Evaluation

Citation

About

Releases

Packages

Languages

License

rxn4chemistry/rxn_cluster_token_prompt

Folders and files

Latest commit

History

Repository files navigation

Enhancing diversity in language based models for single-step retrosynthesis

Install / Build

Create Environment and Install

Try it out!

USPTO Datasets generation

Construction of the clusterers

Prediction of the cluster id

Training

Prediction

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Languages