# Optimization
This notebook goes through the whole process to use the optimized model and produce the synthesis planning.

## Environment Setup
First of all, we need to make sure that the notebook is running in the correct environment.

To do that, follow these steps:
 1. Create the project's environment
    To do that, place yourself in the project's root, and run :
    `conda env create -f environment.yml`
    This creates a new clean conda environment with the package needed by the project.
 2. Activate the environment
    On Linux and Mac:
    `source activate synnet`
    On Windows:
    `conda activate synnet`
 3. Install the project's module
    Now that the environment is activated, we need to install the project as a module.
    Place yourself in the project's root and run :
    `pip install -e .`
 4. Restart Jupyter from the new environment
    Now, we can start Jupyter from the environment, that way it has all the dependencies we need. Simply run `jupyter notebook` and open this notebook.

To test the setup, run the following cell.

In [1]:
import sys

# Check that the correct conda env is being used
if sys.prefix.split("\\")[-1] != "synnet":
    print("You are not using the correct conda environment, please follow the instructions above")
else:
    try:
        import synnet

        print("The environment is setup correctly")
    except ImportError:
        print("The module 'synnet' is not installed, please follow the instructions above")

The environment is setup correctly


## Pre-Processing

Now that the conda environment is correctly setup, we can start the preliminary steps to produce the synthesis results.

First, let's import some packages, define some constants.
Make sure they are correct.

In [3]:
from downloader import *
from preprocessor import *
from optimize import optimize
import random

project_root = Path("..")  # Path to the project's root folder
cpu_cores = 6  # Number of cores to use for the computation. The greater, the faster
num_samples = 50  # Number of molecules to randomly pick from the datasets

random.seed(2022)

### Downloads
First, we need to choose the trained model to use. **We can use the model we trained, or the one paper's writers used to get their results.

But for the latter, we need to download it.

In [4]:
original_checkpoints = get_original_checkpoints(project_root)

original_checkpoints is already present, no need to compute it
<class 'synnet.models.mlp.MLP'>


In [4]:
chembl_smiles = get_chembl_dataset(project_root, num_samples)

ChEMBL is already present, no need to compute it


Now, we need to retrieve the building blocks. We asked the company to provide them, that way we can correctly reproduce their result

In [5]:
bblocks_raw = get_building_blocks(project_root)

building_blocks is already present, no need to compute it


We also need to download the molecules we want to test the model on

In [6]:
zinc_smiles = get_zinc_dataset(project_root, num_samples)

ZINC is already present, no need to compute it


### Process Building Blocks

### Filter Building Blocks
We pre-process the building blocks to identify applicable reactants for each reaction template. In other words, filter out all building blocks that do not match any reaction template. There is no need to keep them, as they cannot act as reactant.

In a first step, we match all building blocks with each reaction template.
In a second step, we save all matched building blocks and a collection of `Reactions` with their available building blocks.

In [7]:
bblocks, rxn_collection = filter_bblocks(project_root, bblocks_raw, cpu_cores)

filtered building blocks is already present, no need to compute it
rections is already present, no need to compute it


### Pre-compute embeddings

We use the embedding space for the building blocks a lot. Hence, we pre-compute and store the building blocks.

In [8]:
mol_embedder = compute_embeddings(project_root, bblocks, cpu_cores)

embeddings is already present, no need to compute it


# Optimization

In [None]:
optimize(
    zinc_smiles[:5], # 5 samples
    bblocks,
    rxn_collection,
    original_checkpoints,
    mol_embedder,
    project_root / "reproducibility" / "results" / "optimize" / "zinc",
    nbits=4096,
    num_gen=2,
    objective="gsk",
    rxn_template="hb",
    num_offspring=128,
    cpu_cores=cpu_cores
)

Start.
Data loaded


Downloading Oracle...
100%|██████████| 27.8M/27.8M [00:13<00:00, 1.99MiB/s]
Done!


Initial: 0.010 +/- 0.006
Scores: [0.02 0.01 0.01 0.01 0.  ]
Top-3 Smiles: ['O=C(c1ccccc1)c1ccc(CBr)cc1', 'O=Cc1ccc(Cl)cc1OCc1cc(Cl)c(Br)c(Cl)c1', 'Cc1cc(C2CCCN2C(=O)c2c(C)noc2C2CC2)on1']
Starting generation 0


In [None]:
optimize(
    chembl_smiles[:5], # 5 samples
    bblocks,
    rxn_collection,
    original_checkpoints,
    mol_embedder,
    project_root / "reproducibility" / "results" / "optimize" / "chembl",
    nbits=4096,
    num_gen=2,
    objective="gsk",
    rxn_template="hb",
    num_offspring=128,
    cpu_cores=cpu_cores
)