Merge pull request #6 from chrulm/refactor-inference

Refactor inference
wenhao-gao · Oct 12, 2022 · 82d00db · 82d00db
2 parents 8a5e84a + a12744e
commit 82d00db
Show file tree

Hide file tree

Showing 11 changed files with 302 additions and 234 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,6 +3,8 @@
 data/
 figures/syntrees/
 results/
+checkpoints/
+oracle/
 logs/
 tmp/
 .dev/

diff --git a/INSTRUCTIONS.md b/INSTRUCTIONS.md
@@ -2,13 +2,13 @@
 
 This documents outlines the process to train SynNet from scratch step-by-step.
 
-> :warning: It is still a WIP to match the filenames of the scripts to the instructions here and to simplify the dependency on parameters/filenames.
+> :warning: It is still a WIP.
 
 You can use any set of reaction templates and building blocks, but we will illustrate the process with the *Hartenfeller-Button* reaction templates and *Enamine building blocks*.
 
 *Note*: This project depends on a lot of exact filenames.
 For example, one script will save to file, the next will read that file for further processing.
-It is not a perfect approach - we are open to feedback - and advise to revise the parameters defined in each script.
+It is not a perfect approach - we are open to feedback.
 
 Let's start.
 
@@ -20,7 +20,8 @@ Let's start.
 
     ```shell
     python scripts/00-extract-smiles-from-sdf.py \
-        --input-file="data/assets/building-blocks/enamine-us.sdf"
+        --input-file="data/assets/building-blocks/enamine-us.sdf" \
+        --output-file="data/assets/building-blocks/enamine-us-smiles.csv.gz"
     ```
 
 1. Filter building blocks.
@@ -49,8 +50,9 @@ Let's start.
 
     ```bash
     python scripts/02-compute-embeddings.py \
-        --building-blocks-file "data/pre-process/building-blocks/enamine-us-smiles.csv.gz" \
-        --output-file "data/pre-process/embeddings/hb-enamine-embeddings.npy"
+        --building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
+        --output-file "data/pre-process/embeddings/hb-enamine-embeddings.npy" \
+        --featurization-fct "fp_256"
     ```
 
 3. Generate *synthetic trees*
@@ -61,10 +63,10 @@ Let's start.
     ```bash
     # Generate synthetic trees
     python scripts/03-generate-syntrees.py \
-        --building-blocks-file "data/pre-process/building-blocks/enamine-us-smiles.csv.gz" \
-        --rxn-templates-file   "data/assets/reaction-templates/hb.txt" \
-        --output-file          "data/pre-process/synthetic-trees.json.gz" \
-        --number-syntrees 600000
+        --building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
+        --rxn-templates-file "data/assets/reaction-templates/hb.txt" \
+        --output-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
+        --number-syntrees "600000"
     ```
 
     In a second step, we filter out some synthetic trees to make the data pharmaceutically more interesting.
@@ -73,25 +75,26 @@ Let's start.
     ```bash
     # Filter
     python scripts/04-filter-syntrees.py \
-        --input-file  "data/pre-process/synthetic-trees.json.gz" \
-        --output-file "data/pre-process/synthetic-trees-filtered.json.gz"
+        --input-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
+        --output-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
+        --verbose
     ```
 
     Each *synthetic tree* is serializable and so we save all trees in a compressed `.json` file.
 
-4. Split *synthetic trees* into train,valid,test-data
+5. Split *synthetic trees* into train,valid,test-data
 
     We load the `.json`-file with all *synthetic trees* and
     straightforward split it into three files: `{train,test,valid}.json`.
     The default split ratio is 6:2:2.
 
     ```bash
     python scripts/05-split-syntrees.py \
-        --input-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
-        --output-dir "data/pre-process/syntrees/"
+            --input-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
+            --output-dir "data/pre-process/syntrees/" --verbose
     ```
 
-5. Featurization
+6. Featurization
 
    We featurize each *synthetic tree*.
    That is, we break down each tree to each iteration step ("Add", "Expand", "Extend", "End") and featurize it.
@@ -100,8 +103,8 @@ Let's start.
 
     ```bash
     python scripts/06-featurize-syntrees.py \
-        --input-dir "data/pre-process/syntrees/"
-        --output-dir "data/featurized" --verbose
+        --input-dir "data/pre-process/syntrees/" \
+        --output-dir "data/featurized/" --verbose
     ```
 
     This script will load the `{train,valid,test}` data, featurize it, and save it in
@@ -111,7 +114,7 @@ Let's start.
     The encoders for the molecules must be provided in the script.
     A short text summary of the encoders will be saved as well.
 
-6. Split features
+7. Split features
 
     Up to this point, we worked with a (featurized) *synthetic tree* as a whole,
     now we split it up to into "consumable" input/output data for each of the four networks.
@@ -125,12 +128,12 @@ Let's start.
     This will create 24 new files (3 splits, 4 networks, X + y).
     All new files will be saved in `<input-dir>/Xy`.
 
-7. Train the networks
+8. Train the networks
 
-    Finally, we can train each of the four networks in `src/syn_net/models/` separately:
+    Finally, we can train each of the four networks in `src/synnet/models/` separately. For example:
 
     ```bash
-    python src/syn_net/models/act.py
+    python src/synnet/models/act.py
     ```
 
 After training a new model, you can then use the trained model to make predictions and construct synthetic trees for a list given set of molecules.
@@ -148,15 +151,11 @@ To visualize trees, there is a hacky script that represents *Synthetic Trees* as
 To demo it:
 
 ```bash
-python src/syn_net/visualize/visualizer.py
+python src/synnet/visualize/visualizer.py
 ```
 
 Still to be implemented: i) target molecule, ii) "end" action
 
 To render the markdown file incl. the diagram directly in VS Code, install the extension [vscode-markdown-mermaid](https://github.com/mjbvz/vscode-markdown-mermaid) and use the built-in markdown preview.
 
 *Info*: If the images of the molecules do not load, edit + save the markdown file anywhere. For example add and delete a character with the preview open. Not sure why this happens.
-
-### Mean reciprocal rank
-
-To be added.
diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
 # SynNet
 
-This repo contains the code and analysis scripts for our amortized approach to synthetic tree generation using neural networks. Our model can serve as both a synthesis planning tool and as a tool for synthesizable molecular design.
+This repo contains the code and analysis scripts for our amortized approach to synthetic tree generation using neural networks.
+Our model can serve as both a synthesis planning tool and as a tool for synthesizable molecular design.
 
 The method is described in detail in the publication "Amortized tree generation for bottom-up synthesis planning and synthesizable molecular design" available on the [arXiv](https://arxiv.org/abs/2110.06389) and summarized below.
 
@@ -30,25 +31,31 @@ The model consists of four modules, each containing a multi-layer perceptron (ML
 
 ![the model](./figures/network.png "model scheme")
 
-These four modules predict the probability distributions of actions to be taken within a single reaction step, and determine the nodes to be added to the synthetic tree under construction. All of these networks are conditioned on the target molecule embedding.
+These four modules predict the probability distributions of actions to be taken within a single reaction step, and determine the nodes to be added to the synthetic tree under construction.
+All of these networks are conditioned on the target molecule embedding.
 
 ### Synthesis planning
 
-This task is to infer the synthetic pathway to a given target molecule. We formulate this problem as generating a synthetic tree such that the product molecule it produces (i.e., the molecule at the root node) matches the desired target molecule.
+This task is to infer the synthetic pathway to a given target molecule.
+We formulate this problem as generating a synthetic tree such that the product molecule it produces (i.e., the molecule at the root node) matches the desired target molecule.
 
-For this task, we can take a molecular embedding for the desired product, and use it as input to our model to produce a synthetic tree. If the desired product is successfully recovered, then the final root molecule will match the desired molecule used to create the input embedding. If the desired product is not successully recovered, it is possible the final root molecule may still be *similar* to the desired molecule used to create the input embedding, and thus our tool can also be used for *synthesizable analog recommendation*.
+For this task, we can take a molecular embedding for the desired product, and use it as input to our model to produce a synthetic tree.
+If the desired product is successfully recovered, then the final root molecule will match the desired molecule used to create the input embedding.
+If the desired product is not successully recovered, it is possible the final root molecule may still be *similar* to the desired molecule used to create the input embedding, and thus our tool can also be used for *synthesizable analog recommendation*.
 
 ![the generation process](./figures/generation_process.png "generation process")
 
 ### Synthesizable molecular design
 
-This task is to optimize a molecular structure with respect to an oracle function (e.g. bioactivity), while ensuring the synthetic accessibility of the molecules. We formulate this problem as optimizing the structure of a synthetic tree with respect to the desired properties of the product molecule it produces.
+This task is to optimize a molecular structure with respect to an oracle function (e.g. bioactivity), while ensuring the synthetic accessibility of the molecules.
+We formulate this problem as optimizing the structure of a synthetic tree with respect to the desired properties of the product molecule it produces.
 
-To do this, we optimize the molecular embedding of the molecule using a genetic algorithm and the desired oracle function. The optimized molecule embedding can then be used as input to our model to produce a synthetic tree, where the final root molecule corresponds to the optimized molecule.
+To do this, we optimize the molecular embedding of the molecule using a genetic algorithm and the desired oracle function.
+The optimized molecule embedding can then be used as input to our model to produce a synthetic tree, where the final root molecule corresponds to the optimized molecule.
 
 ## Setup instructions
 
-### Setting up the environment
+### Environment
 
 Conda is used to create the environment for running SynNet.
 
@@ -57,13 +64,22 @@ Conda is used to create the environment for running SynNet.
 conda env create -f environment.yml
 ```
 
-Before running any SynNet code, activate the environment and install this package in development mode. This ensures the scripts can find the right files. You can do this by typing:
+Before running any SynNet code, activate the environment and install this package in development mode:
 
 ```bash
 source activate synnet
 pip install -e .
 ```
 
+The model implementations can be found in `src/syn_net/models/`.
+
+The pre-processing and analysis scripts are in `scripts/`.
+
+### Train the model from scratch
+
+Before training any models, you will first need to some data preprocessing.
+Please see [INSTRUCTIONS.md](INSTRUCTIONS.md) for a complete guide.
+
 ### Data
 
 SynNet relies on two datasources:
@@ -77,11 +93,6 @@ The building blocks are not freely available.
 To obtain the data, go to [https://enamine.net/building-blocks/building-blocks-catalog](https://enamine.net/building-blocks/building-blocks-catalog).
 We used the "Building Blocks, US Stock" data. You need to first register and then request access to download the dataset. The people from enamine.net manually approve you, so please be nice and patient.
 
-## Code Structure
-
-The model implementations can be found in [src/syn_net/models/](src/syn_net/models/).
-The pre-processing and analysis scripts are in [scripts/](scripts/).
-
 ## Reproducing results
 
 Before running anything, set up the environment as decribed above.
@@ -95,11 +106,18 @@ For further details, please see the publication.
 To download the pre-trained model to `./checkpoints`:
 
 ```bash
-mkdir -p checkpoints && cd checkpoints
 # Download
 wget -O hb_fp_2_4096_256.tar.gz https://figshare.com/ndownloader/files/31067692
 # Extract
 tar -vxf hb_fp_2_4096_256.tar.gz
+# Rename files to match new scripts (...)
+mv hb_fp_2_4096_256/ checkpoints/
+for model in "act" "rt1" "rxn" "rt2"
+do
+  mkdir checkpoints/$model
+  mv "checkpoints/$model.ckpt" "checkpoints/$model/ckpts.dummy-val_loss=0.00.ckpt"
+done
+rm -f hb_fp_2_4096_256.tar.gz
 ```
 
 The following scripts are run from the command line.
@@ -109,51 +127,40 @@ Use `python some_script.py --help` or check the source code to see the instructi
 
 In addition to the necessary data, we will need to pre-compute an embedding of the building blocks.
 To do so, please follow steps 0-2 from the [INSTRUCTIONS.md](INSTRUCTIONS.md).
+Then, replace the environment variables in the commands below.
 
 #### Synthesis Planning
 
 To perform synthesis planning described in the main text:
 
 ```bash
-python scripts/predict_multireactant_mp.py \
-    -n -1 \
+python scripts/20-predict-targets.py \
+    --building-blocks-file $BUILDING_BLOCKS_FILE \
+    --rxns-collection-file $RXN_COLLECTION_FILE \
+    --embeddings-knn-file $EMBEDDINGS_KNN_FILE \
     --data "data/assets/molecules/sample-targets.txt" \
-    --ncpu 10
+    --ckpt-dir "checkpoints/" \
+    --output-dir "results/demo-inference/"
 ```
 
-This script will feed a list of ten randomly selected molecules from the validation to SynNet.
-The decoded results, i.e. the predicted synthesis trees, are saved to `DATA_RESULT_DIR`.
-(Paths are defined in [src/syn_net/config.py](src/syn_net/config.py).)
-
-*Note*: To do synthesis planning, you will need a list of target molecules (provided), building blocks (need to download) and embeddings (need to compute).
+This script will feed a list of ten molecules to SynNet.
 
 #### Synthesizable Molecular Design
 
 To perform synthesizable molecular design, run:
 
 ```bash
 python scripts/optimize_ga.py \
-    -i path/to/zinc.csv \
+    --ckpt-dir "checkpoints/" \
+      --building-blocks-file $BUILDING_BLOCKS_FILE \
+    --rxns-collection-file $RXN_COLLECTION_FILE \
+    --embeddings-knn-file $EMBEDDINGS_KNN_FILE \
+    --input-file path/to/zinc.csv \
     --radius 2 --nbits 4096 \
     --num_population 128 --num_offspring 512 --num_gen 200 --objective gsk \
     --ncpu 32
 ```
 
 This script uses a genetic algorithm to optimize molecular embeddings and returns the predicted synthetic trees for the optimized molecular embedding.
 
-If user wants to start from a checkpoint of previous run, run:
-
-```bash
-python scripts/optimize_ga.py \
-    -i path/to/population.npy \
-    --radius 2 --nbits 4096 \
-    --num_population 128 --num_offspring 512 --num_gen 200 --objective gsk --restart \
-    --ncpu 32
-```
-
-Note: the input file indicated by `-i` contains the seed molecules in CSV format for an initial run, and as a pre-saved numpy array of the population for restarting the run.
-
-### Train the model from scratch
-
-Before training any models, you will first need to some data preprocessing.
-Please see [INSTRUCTIONS.md](INSTRUCTIONS.md) for a complete guide.
+Note: `input-file` contains the seed molecules in CSV format for an initial run, and as a pre-saved numpy array of the population for restarting the run. If omitted, a random fingerprint will be chosen.
diff --git a/scripts/20-predict-targets.py b/scripts/20-predict-targets.py
@@ -5,15 +5,15 @@
 import logging
 import multiprocessing as mp
 from pathlib import Path
-from typing import Tuple, Union
+from typing import Tuple
 
 import numpy as np
 import pandas as pd
 
 from synnet.config import DATA_PREPROCESS_DIR, DATA_RESULT_DIR, MAX_PROCESSES
 from synnet.data_generation.preprocessing import BuildingBlockFileHandler
 from synnet.encoding.distances import cosine_distance
-from synnet.models.mlp import load_mlp_from_ckpt
+from synnet.models.common import find_best_model_ckpt, load_mlp_from_ckpt
 from synnet.MolEmbedder import MolEmbedder
 from synnet.utils.data_utils import ReactionSet, SyntheticTree, SyntheticTreeSet
 from synnet.utils.predict_utils import mol_fp, synthetic_tree_decoder_greedy_search
@@ -49,38 +49,6 @@ def _fetch_data(name: str) -> list[str]:
     return smiles
 
 
-def find_best_model_ckpt(path: str) -> Union[Path, None]:  # TODO: move to utils.py
-    """Find checkpoint with lowest val_loss.
-
-    Poor man's regex:
-    somepath/act/ckpts.epoch=70-val_loss=0.03.ckpt
-                                         ^^^^--extract this as float
-    """
-    ckpts = Path(path).rglob("*.ckpt")
-    best_model_ckpt = None
-    lowest_loss = 10_000 # ~ math.inf
-    for file in ckpts:
-        stem = file.stem
-        val_loss = float(stem.split("val_loss=")[-1])
-        if val_loss < lowest_loss:
-            best_model_ckpt = file
-            lowest_loss = val_loss
-    return best_model_ckpt
-
-
-def _load_pretrained_model(path_to_checkpoints: list[Path]):
-    """Wrapper to load modules from checkpoint."""
-    # Define paths to pretrained models.
-    act_path, rt1_path, rxn_path, rt2_path = path_to_checkpoints
-
-    # Load the pre-trained models.
-    act_net = load_mlp_from_ckpt(act_path)
-    rt1_net = load_mlp_from_ckpt(rt1_path)
-    rxn_net = load_mlp_from_ckpt(rxn_path)
-    rt2_net = load_mlp_from_ckpt(rt2_path)
-    return act_net, rt1_net, rxn_net, rt2_net
-
-
 def wrapper_decoder(smiles: str) -> Tuple[str, float, SyntheticTree]:
     """Generate a synthetic tree for the input molecular embedding."""
     emb = mol_fp(smiles)
@@ -188,8 +156,8 @@ def get_args():
     # ... models
     logger.info("Start loading models from checkpoints...")
     path = Path(args.ckpt_dir)
-    paths = [find_best_model_ckpt(path / model) for model in "act rt1 rxn rt2".split()]
-    act_net, rt1_net, rxn_net, rt2_net = _load_pretrained_model(paths)
+    ckpt_files = [find_best_model_ckpt(path / model) for model in "act rt1 rxn rt2".split()]
+    act_net, rt1_net, rxn_net, rt2_net = [load_mlp_from_ckpt(file) for file in ckpt_files]
     logger.info("...loading models completed.")
 
     # Decode queries, i.e. the target molecules.
@@ -205,9 +173,9 @@ def get_args():
     # Print some results from the prediction
     # Note: If a syntree cannot be decoded within `max_depth` steps (15),
     #       we will count it as unsuccessful. The similarity will be 0.
-    decoded = [smi for smi, _, _ in results ]
-    similarities = [sim for _, sim, _ in results ]
-    trees = [tree for _, _, tree in results ]
+    decoded = [smi for smi, _, _ in results]
+    similarities = [sim for _, sim, _ in results]
+    trees = [tree for _, _, tree in results]
 
     recovery_rate = (np.asfarray(similarities) == 1.0).sum() / len(similarities)
     avg_similarity = np.mean(similarities)