Skip to content

Commit

Permalink
Merge pull request #6 from chrulm/refactor-inference
Browse files Browse the repository at this point in the history
Refactor inference
  • Loading branch information
chrulm committed Oct 12, 2022
2 parents 8a5e84a + a12744e commit 82d00db
Show file tree
Hide file tree
Showing 11 changed files with 302 additions and 234 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
data/
figures/syntrees/
results/
checkpoints/
oracle/
logs/
tmp/
.dev/
Expand Down
51 changes: 25 additions & 26 deletions INSTRUCTIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

This documents outlines the process to train SynNet from scratch step-by-step.

> :warning: It is still a WIP to match the filenames of the scripts to the instructions here and to simplify the dependency on parameters/filenames.
> :warning: It is still a WIP.
You can use any set of reaction templates and building blocks, but we will illustrate the process with the *Hartenfeller-Button* reaction templates and *Enamine building blocks*.

*Note*: This project depends on a lot of exact filenames.
For example, one script will save to file, the next will read that file for further processing.
It is not a perfect approach - we are open to feedback - and advise to revise the parameters defined in each script.
It is not a perfect approach - we are open to feedback.

Let's start.

Expand All @@ -20,7 +20,8 @@ Let's start.

```shell
python scripts/00-extract-smiles-from-sdf.py \
--input-file="data/assets/building-blocks/enamine-us.sdf"
--input-file="data/assets/building-blocks/enamine-us.sdf" \
--output-file="data/assets/building-blocks/enamine-us-smiles.csv.gz"
```

1. Filter building blocks.
Expand Down Expand Up @@ -49,8 +50,9 @@ Let's start.

```bash
python scripts/02-compute-embeddings.py \
--building-blocks-file "data/pre-process/building-blocks/enamine-us-smiles.csv.gz" \
--output-file "data/pre-process/embeddings/hb-enamine-embeddings.npy"
--building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
--output-file "data/pre-process/embeddings/hb-enamine-embeddings.npy" \
--featurization-fct "fp_256"
```

3. Generate *synthetic trees*
Expand All @@ -61,10 +63,10 @@ Let's start.
```bash
# Generate synthetic trees
python scripts/03-generate-syntrees.py \
--building-blocks-file "data/pre-process/building-blocks/enamine-us-smiles.csv.gz" \
--rxn-templates-file "data/assets/reaction-templates/hb.txt" \
--output-file "data/pre-process/synthetic-trees.json.gz" \
--number-syntrees 600000
--building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
--rxn-templates-file "data/assets/reaction-templates/hb.txt" \
--output-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
--number-syntrees "600000"
```

In a second step, we filter out some synthetic trees to make the data pharmaceutically more interesting.
Expand All @@ -73,25 +75,26 @@ Let's start.
```bash
# Filter
python scripts/04-filter-syntrees.py \
--input-file "data/pre-process/synthetic-trees.json.gz" \
--output-file "data/pre-process/synthetic-trees-filtered.json.gz"
--input-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
--output-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
--verbose
```

Each *synthetic tree* is serializable and so we save all trees in a compressed `.json` file.

4. Split *synthetic trees* into train,valid,test-data
5. Split *synthetic trees* into train,valid,test-data

We load the `.json`-file with all *synthetic trees* and
straightforward split it into three files: `{train,test,valid}.json`.
The default split ratio is 6:2:2.

```bash
python scripts/05-split-syntrees.py \
--input-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
--output-dir "data/pre-process/syntrees/"
--input-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
--output-dir "data/pre-process/syntrees/" --verbose
```

5. Featurization
6. Featurization

We featurize each *synthetic tree*.
That is, we break down each tree to each iteration step ("Add", "Expand", "Extend", "End") and featurize it.
Expand All @@ -100,8 +103,8 @@ Let's start.

```bash
python scripts/06-featurize-syntrees.py \
--input-dir "data/pre-process/syntrees/"
--output-dir "data/featurized" --verbose
--input-dir "data/pre-process/syntrees/" \
--output-dir "data/featurized/" --verbose
```

This script will load the `{train,valid,test}` data, featurize it, and save it in
Expand All @@ -111,7 +114,7 @@ Let's start.
The encoders for the molecules must be provided in the script.
A short text summary of the encoders will be saved as well.

6. Split features
7. Split features

Up to this point, we worked with a (featurized) *synthetic tree* as a whole,
now we split it up to into "consumable" input/output data for each of the four networks.
Expand All @@ -125,12 +128,12 @@ Let's start.
This will create 24 new files (3 splits, 4 networks, X + y).
All new files will be saved in `<input-dir>/Xy`.

7. Train the networks
8. Train the networks

Finally, we can train each of the four networks in `src/syn_net/models/` separately:
Finally, we can train each of the four networks in `src/synnet/models/` separately. For example:

```bash
python src/syn_net/models/act.py
python src/synnet/models/act.py
```

After training a new model, you can then use the trained model to make predictions and construct synthetic trees for a list given set of molecules.
Expand All @@ -148,15 +151,11 @@ To visualize trees, there is a hacky script that represents *Synthetic Trees* as
To demo it:

```bash
python src/syn_net/visualize/visualizer.py
python src/synnet/visualize/visualizer.py
```

Still to be implemented: i) target molecule, ii) "end" action

To render the markdown file incl. the diagram directly in VS Code, install the extension [vscode-markdown-mermaid](https://github.com/mjbvz/vscode-markdown-mermaid) and use the built-in markdown preview.

*Info*: If the images of the molecules do not load, edit + save the markdown file anywhere. For example add and delete a character with the preview open. Not sure why this happens.

### Mean reciprocal rank

To be added.
85 changes: 46 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# SynNet

This repo contains the code and analysis scripts for our amortized approach to synthetic tree generation using neural networks. Our model can serve as both a synthesis planning tool and as a tool for synthesizable molecular design.
This repo contains the code and analysis scripts for our amortized approach to synthetic tree generation using neural networks.
Our model can serve as both a synthesis planning tool and as a tool for synthesizable molecular design.

The method is described in detail in the publication "Amortized tree generation for bottom-up synthesis planning and synthesizable molecular design" available on the [arXiv](https://arxiv.org/abs/2110.06389) and summarized below.

Expand Down Expand Up @@ -30,25 +31,31 @@ The model consists of four modules, each containing a multi-layer perceptron (ML

![the model](./figures/network.png "model scheme")

These four modules predict the probability distributions of actions to be taken within a single reaction step, and determine the nodes to be added to the synthetic tree under construction. All of these networks are conditioned on the target molecule embedding.
These four modules predict the probability distributions of actions to be taken within a single reaction step, and determine the nodes to be added to the synthetic tree under construction.
All of these networks are conditioned on the target molecule embedding.

### Synthesis planning

This task is to infer the synthetic pathway to a given target molecule. We formulate this problem as generating a synthetic tree such that the product molecule it produces (i.e., the molecule at the root node) matches the desired target molecule.
This task is to infer the synthetic pathway to a given target molecule.
We formulate this problem as generating a synthetic tree such that the product molecule it produces (i.e., the molecule at the root node) matches the desired target molecule.

For this task, we can take a molecular embedding for the desired product, and use it as input to our model to produce a synthetic tree. If the desired product is successfully recovered, then the final root molecule will match the desired molecule used to create the input embedding. If the desired product is not successully recovered, it is possible the final root molecule may still be *similar* to the desired molecule used to create the input embedding, and thus our tool can also be used for *synthesizable analog recommendation*.
For this task, we can take a molecular embedding for the desired product, and use it as input to our model to produce a synthetic tree.
If the desired product is successfully recovered, then the final root molecule will match the desired molecule used to create the input embedding.
If the desired product is not successully recovered, it is possible the final root molecule may still be *similar* to the desired molecule used to create the input embedding, and thus our tool can also be used for *synthesizable analog recommendation*.

![the generation process](./figures/generation_process.png "generation process")

### Synthesizable molecular design

This task is to optimize a molecular structure with respect to an oracle function (e.g. bioactivity), while ensuring the synthetic accessibility of the molecules. We formulate this problem as optimizing the structure of a synthetic tree with respect to the desired properties of the product molecule it produces.
This task is to optimize a molecular structure with respect to an oracle function (e.g. bioactivity), while ensuring the synthetic accessibility of the molecules.
We formulate this problem as optimizing the structure of a synthetic tree with respect to the desired properties of the product molecule it produces.

To do this, we optimize the molecular embedding of the molecule using a genetic algorithm and the desired oracle function. The optimized molecule embedding can then be used as input to our model to produce a synthetic tree, where the final root molecule corresponds to the optimized molecule.
To do this, we optimize the molecular embedding of the molecule using a genetic algorithm and the desired oracle function.
The optimized molecule embedding can then be used as input to our model to produce a synthetic tree, where the final root molecule corresponds to the optimized molecule.

## Setup instructions

### Setting up the environment
### Environment

Conda is used to create the environment for running SynNet.

Expand All @@ -57,13 +64,22 @@ Conda is used to create the environment for running SynNet.
conda env create -f environment.yml
```

Before running any SynNet code, activate the environment and install this package in development mode. This ensures the scripts can find the right files. You can do this by typing:
Before running any SynNet code, activate the environment and install this package in development mode:

```bash
source activate synnet
pip install -e .
```

The model implementations can be found in `src/syn_net/models/`.

The pre-processing and analysis scripts are in `scripts/`.

### Train the model from scratch

Before training any models, you will first need to some data preprocessing.
Please see [INSTRUCTIONS.md](INSTRUCTIONS.md) for a complete guide.

### Data

SynNet relies on two datasources:
Expand All @@ -77,11 +93,6 @@ The building blocks are not freely available.
To obtain the data, go to [https://enamine.net/building-blocks/building-blocks-catalog](https://enamine.net/building-blocks/building-blocks-catalog).
We used the "Building Blocks, US Stock" data. You need to first register and then request access to download the dataset. The people from enamine.net manually approve you, so please be nice and patient.

## Code Structure

The model implementations can be found in [src/syn_net/models/](src/syn_net/models/).
The pre-processing and analysis scripts are in [scripts/](scripts/).

## Reproducing results

Before running anything, set up the environment as decribed above.
Expand All @@ -95,11 +106,18 @@ For further details, please see the publication.
To download the pre-trained model to `./checkpoints`:

```bash
mkdir -p checkpoints && cd checkpoints
# Download
wget -O hb_fp_2_4096_256.tar.gz https://figshare.com/ndownloader/files/31067692
# Extract
tar -vxf hb_fp_2_4096_256.tar.gz
# Rename files to match new scripts (...)
mv hb_fp_2_4096_256/ checkpoints/
for model in "act" "rt1" "rxn" "rt2"
do
mkdir checkpoints/$model
mv "checkpoints/$model.ckpt" "checkpoints/$model/ckpts.dummy-val_loss=0.00.ckpt"
done
rm -f hb_fp_2_4096_256.tar.gz
```

The following scripts are run from the command line.
Expand All @@ -109,51 +127,40 @@ Use `python some_script.py --help` or check the source code to see the instructi

In addition to the necessary data, we will need to pre-compute an embedding of the building blocks.
To do so, please follow steps 0-2 from the [INSTRUCTIONS.md](INSTRUCTIONS.md).
Then, replace the environment variables in the commands below.

#### Synthesis Planning

To perform synthesis planning described in the main text:

```bash
python scripts/predict_multireactant_mp.py \
-n -1 \
python scripts/20-predict-targets.py \
--building-blocks-file $BUILDING_BLOCKS_FILE \
--rxns-collection-file $RXN_COLLECTION_FILE \
--embeddings-knn-file $EMBEDDINGS_KNN_FILE \
--data "data/assets/molecules/sample-targets.txt" \
--ncpu 10
--ckpt-dir "checkpoints/" \
--output-dir "results/demo-inference/"
```

This script will feed a list of ten randomly selected molecules from the validation to SynNet.
The decoded results, i.e. the predicted synthesis trees, are saved to `DATA_RESULT_DIR`.
(Paths are defined in [src/syn_net/config.py](src/syn_net/config.py).)

*Note*: To do synthesis planning, you will need a list of target molecules (provided), building blocks (need to download) and embeddings (need to compute).
This script will feed a list of ten molecules to SynNet.

#### Synthesizable Molecular Design

To perform synthesizable molecular design, run:

```bash
python scripts/optimize_ga.py \
-i path/to/zinc.csv \
--ckpt-dir "checkpoints/" \
--building-blocks-file $BUILDING_BLOCKS_FILE \
--rxns-collection-file $RXN_COLLECTION_FILE \
--embeddings-knn-file $EMBEDDINGS_KNN_FILE \
--input-file path/to/zinc.csv \
--radius 2 --nbits 4096 \
--num_population 128 --num_offspring 512 --num_gen 200 --objective gsk \
--ncpu 32
```

This script uses a genetic algorithm to optimize molecular embeddings and returns the predicted synthetic trees for the optimized molecular embedding.

If user wants to start from a checkpoint of previous run, run:

```bash
python scripts/optimize_ga.py \
-i path/to/population.npy \
--radius 2 --nbits 4096 \
--num_population 128 --num_offspring 512 --num_gen 200 --objective gsk --restart \
--ncpu 32
```

Note: the input file indicated by `-i` contains the seed molecules in CSV format for an initial run, and as a pre-saved numpy array of the population for restarting the run.

### Train the model from scratch

Before training any models, you will first need to some data preprocessing.
Please see [INSTRUCTIONS.md](INSTRUCTIONS.md) for a complete guide.
Note: `input-file` contains the seed molecules in CSV format for an initial run, and as a pre-saved numpy array of the population for restarting the run. If omitted, a random fingerprint will be chosen.
46 changes: 7 additions & 39 deletions scripts/20-predict-targets.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,15 @@
import logging
import multiprocessing as mp
from pathlib import Path
from typing import Tuple, Union
from typing import Tuple

import numpy as np
import pandas as pd

from synnet.config import DATA_PREPROCESS_DIR, DATA_RESULT_DIR, MAX_PROCESSES
from synnet.data_generation.preprocessing import BuildingBlockFileHandler
from synnet.encoding.distances import cosine_distance
from synnet.models.mlp import load_mlp_from_ckpt
from synnet.models.common import find_best_model_ckpt, load_mlp_from_ckpt
from synnet.MolEmbedder import MolEmbedder
from synnet.utils.data_utils import ReactionSet, SyntheticTree, SyntheticTreeSet
from synnet.utils.predict_utils import mol_fp, synthetic_tree_decoder_greedy_search
Expand Down Expand Up @@ -49,38 +49,6 @@ def _fetch_data(name: str) -> list[str]:
return smiles


def find_best_model_ckpt(path: str) -> Union[Path, None]: # TODO: move to utils.py
"""Find checkpoint with lowest val_loss.
Poor man's regex:
somepath/act/ckpts.epoch=70-val_loss=0.03.ckpt
^^^^--extract this as float
"""
ckpts = Path(path).rglob("*.ckpt")
best_model_ckpt = None
lowest_loss = 10_000 # ~ math.inf
for file in ckpts:
stem = file.stem
val_loss = float(stem.split("val_loss=")[-1])
if val_loss < lowest_loss:
best_model_ckpt = file
lowest_loss = val_loss
return best_model_ckpt


def _load_pretrained_model(path_to_checkpoints: list[Path]):
"""Wrapper to load modules from checkpoint."""
# Define paths to pretrained models.
act_path, rt1_path, rxn_path, rt2_path = path_to_checkpoints

# Load the pre-trained models.
act_net = load_mlp_from_ckpt(act_path)
rt1_net = load_mlp_from_ckpt(rt1_path)
rxn_net = load_mlp_from_ckpt(rxn_path)
rt2_net = load_mlp_from_ckpt(rt2_path)
return act_net, rt1_net, rxn_net, rt2_net


def wrapper_decoder(smiles: str) -> Tuple[str, float, SyntheticTree]:
"""Generate a synthetic tree for the input molecular embedding."""
emb = mol_fp(smiles)
Expand Down Expand Up @@ -188,8 +156,8 @@ def get_args():
# ... models
logger.info("Start loading models from checkpoints...")
path = Path(args.ckpt_dir)
paths = [find_best_model_ckpt(path / model) for model in "act rt1 rxn rt2".split()]
act_net, rt1_net, rxn_net, rt2_net = _load_pretrained_model(paths)
ckpt_files = [find_best_model_ckpt(path / model) for model in "act rt1 rxn rt2".split()]
act_net, rt1_net, rxn_net, rt2_net = [load_mlp_from_ckpt(file) for file in ckpt_files]
logger.info("...loading models completed.")

# Decode queries, i.e. the target molecules.
Expand All @@ -205,9 +173,9 @@ def get_args():
# Print some results from the prediction
# Note: If a syntree cannot be decoded within `max_depth` steps (15),
# we will count it as unsuccessful. The similarity will be 0.
decoded = [smi for smi, _, _ in results ]
similarities = [sim for _, sim, _ in results ]
trees = [tree for _, _, tree in results ]
decoded = [smi for smi, _, _ in results]
similarities = [sim for _, sim, _ in results]
trees = [tree for _, _, tree in results]

recovery_rate = (np.asfarray(similarities) == 1.0).sum() / len(similarities)
avg_similarity = np.mean(similarities)
Expand Down
Loading

0 comments on commit 82d00db

Please sign in to comment.