## MolPAL: Accelerating High-Throughput Virtual Screening Through Molecular Pool-Based Active Learning

ABSTRACT: Structure-based virtual screening is an important tool in early stage drug discovery
that scores the interactions between a target protein and candidate ligands. As virtual
libraries continue to grow (in excess of 108 molecules), so too do the resources necessary to conduct exhaustive virtual screening campaigns on these libraries. However,
Bayesian optimization techniques can aid in their exploration: a surrogate structureproperty relationship model trained on the predicted affinities of a subset of the library
can be applied to the remaining library members, allowing the least promising compounds to be excluded from evaluation. In this study, we assess various surrogate
model architectures, acquisition functions, and acquisition batch sizes as applied to
several protein-ligand docking datasets and observe significant reductions in computational costs, even when using a greedy acquisition strategy; for example, 87.9% of
the top-50000 ligands can be found after testing only 2.4% of a 100M member library.
Such model-guided searches mitigate the increasing computational costs of screening
increasingly large virtual libraries and can accelerate high-throughput virtual screening
campaigns with applications beyond docking.

Link to paper: https://arxiv.org/pdf/2012.07127v1.pdf

Credit: https://github.com/coleygroup/molpal

Google Colab: https://colab.research.google.com/drive/1RtkSjOSc2mJoXAoOGuMYpXcg7Ls95btQ?usp=sharing

In [1]:
# Clone the repository and cd into directory
!git clone https://github.com/coleygroup/molpal.git
%cd molpal

/content/molpal


In [None]:
# Install requirements / dependencies
!pip install ray pytorch-lightning tensorflow tensorflow-addons typing-extensions git+https://github.com/reymond-group/map4@v1.0 optuna

# Install RDKit 
!pip install rdkit-pypi==2021.3.1.5

### Setting up a ray cluster
MolPAL parallelizes objective function calculation and model inference (training coming later) using the [`ray`](ray.io) library. MolPAL will automatically start a ray cluster if none exists, but this is highly limiting because it can't leverage distributed resources nor will it accurately reflect allocated resources (i.e, it will think you have access to all N cores on a cluster node, regardless of your allocation.)

To properly leverage multi-node allocations, you must set up a ray cluster manually before running MolPAL. The [documentation](https://docs.ray.io/en/master/cluster/index.html) has several examples of how to set up a ray cluster, and the only thing specific to MolPAL is the reliance on two environment variables: `redis_password` and `ip_head`. MolPAL will use the values of these environment variables to connect to the proper ray cluster. An example of this may be seen in the SLURM submission script [`run_molpal.batch`](run_molpal.batch)

### Object Model
MolPAL is a software for batched, Bayesian optimization in a virtual screening environment. At the core of this software is the `molpal` library, which implements several classes that handle specific elements of the optimization routine.

__Explorer__: An [`Explorer`](molpal/explorer.py) is the abstraction of the optimization routine. It ties together the `MoleculePool`, `Acquirer`, `Encoder`, `Model`, and `Objective`, which each handle (roughly) a single step of a Bayesian optimization loop, into a full optimization procedure. Its main functionality is defined by the `run()` method, which performs the optimization until a stopping condition is met, but it also defines other convenience functions that make it amenable to running a single iteration of the optimization loop and interrogating its current state if optimization is desired to be run interactively.

__MoleculePool__: A [`MoleculePool`](molpal/pools/base.py) defines the virtual library (i.e., domain of inputs) and caches precomputed feature representations, if feasible.

__Acquirer__: An [`Acquirer`](molpal/acquirer/acquirer.py) handles acquisition of unlabeled inputs from the MoleculePool according to its `metric` and the prior distribution over the data. The [`metric`](molpal/acquirer/metrics.py) is a function that takes an input array of predictions and returns an array of equal dimension containing acquisition utilities.

__Featurizer__: A [`Featurizer`](molpal/encoder.py) computes the uncompressed feature representation of an input based on its identifier for use with clustering and models that expect vectors as inputs.

__Model__: A [`Model`](molpal/model/base.py) is trained on labeled data to produce a posterior distribution that guides the sequential round of acquisition

__Objective__: An [`Objective`](molpal/objectives/base.py) handles calculation of the objective function for unlabeled inputs

### Preprocessing

For models expecting vectors as inputs (e.g., random forest and feed-forward neural network models,) molecular fingerprints must be calculated first. Given that the set of fingerprints used for inference is the same each time, it makes sense to cache these fingerprints, and that's exactly what the base `MoleculePool` (also referred to as an `EagerMoleculePool`) does. However, the complete set of fingerprints for most libraries would be too large to cache entirely in memory on most systems, so we instead store them on disk in an HDF5 file that is transparently prepared for the user during MolPAL startup (if not already provided with the `--fps` option.)

If you wish to prepare this file ahead of time, you can use [`scripts/fingerprints.py`](scripts/fingerprints.py) to do just this. While this process can be parallelized over an infinitely large ray cluster (see [above](#setting-up-a-ray-cluster),) in our tests we were I/O limited above 12 cores, which takes about 4 hours to prepare an HDF5 file of 100M fingerprints. __Note__: if MolPAL prepares the file for you, it prints a message saying where the file was written to (usually under the $TMP directory) and whether there were invalid SMILES. To reuse this fingerprints file, simply move this file to a persistent directory after MolPAL has completed its run. Additionally, if there were __no__ invalid smiles, you can pass the `--validated` flag in the options to further speed up MolPAL startup.

To prepare the fingerprints file corresopnding to the sample command below, issue the following command: 

In [6]:
!python scripts/fingerprints.py --library libraries/Enamine50k.csv.gz --fingerprint pair --length 2048 --radius 2 --name libraries/fps_enamine50k

2021-05-30 14:17:39,144	INFO services.py:1269 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m
Namespace(delimiter=',', fingerprint='pair', length=2048, library='libraries/Enamine50k.csv.gz', name='libraries/fps_enamine50k', no_title_line=False, path='.', radius=2, smiles_col=0, title_line=True, total_size=None)
Precalculating fps:   0% 0/1 [00:00<?, ?batch/s]
Calculating fingerprints:   0% 0/197 [00:00<?, ?chunk/s][A
Calculating fingerprints:   1% 1/197 [00:00<00:29,  6.72chunk/s][A
Calculating fingerprints:   4% 8/197 [00:00<00:20,  9.19chunk/s][A
Calculating fingerprints:   6% 11/197 [00:00<00:16, 11.57chunk/s][A
Calculating fingerprints:   8% 15/197 [00:00<00:12, 14.57chunk/s][A
Calculating fingerprints:  10% 19/197 [00:00<00:10, 17.56chunk/s][A
Calculating fingerprints:  12% 23/197 [00:00<00:08, 20.85chunk/s][A
Calculating fingerprints:  14% 27/197 [00:00<00:07, 23.89chunk/s][A
Calculating fingerprints:  16% 31/197 [00:00<00:06, 26.82chunk/s][A
Calcul

The resulting fingerprint file will be located in your current working directory as `libraries/fps_enamine50k.h5`. To use this in the sample command below, add `--fps libraries/fps_enamine50k.h5` to the argument list.

### Running MolPAL

#### Examples
The general command to run MolPAL is as follows:

`python molpal.py -o <objective_type> [additional objective arguments] --libary <path/to/library.csv[.gz]> [additional library arguments] [additional model/encoding/acquistion/stopping/logging arguments]`

Alternatively, you may use a configuration file to run MolPAL, like so:

`python molpal.py --config <path/to/config_file>`

Two sample configuration files are provided: [minimal_config.ini](config/minimal_config.ini), a configuration file specifying only the necessary arguments to run MolPAL, and [sample_config.ini](config/sample_config.ini), a configuration file containing a few common options to specify (but not _all_ possible options.)

Configuration files accept the following syntaxes:
- `--arg value` (argparse)
- `arg: value` (YAML)
- `arg = value` (INI)
- `arg value`

A sample command to run one of the experiments used to generate data in the initial publication is as follows:

`python run.py --config config_expts/Enamine50k_retrain.ini --name molpal_50k --metric greedy --init-size 0.01 --batch-size 0.01 --model rf`

or the full command:

In [11]:
!python run.py --name molpal_50k --write-intermediate --write-final --retrain-from-scratch --library libraries/Enamine50k.csv.gz --validated --metric greedy --init-size 0.01 --batch-size 0.01 --model rf --fingerprint pair --length 2048 --radius 2 --objective lookup --lookup-path data/Enamine50k_scores.csv.gz --lookup-smiles-col 0 --lookup-data-col 1 --minimize --top-k 0.01 --window-size 10 --delta 0.01 --max-epochs 5

*********************************************************************
*  __    __     ______     __         ______   ______     __        *
* /\ "-./  \   /\  __ \   /\ \       /\  == \ /\  __ \   /\ \       *
* \ \ \-./\ \  \ \ \/\ \  \ \ \____  \ \  _-/ \ \  __ \  \ \ \____  *
*  \ \_\ \ \_\  \ \_____\  \ \_____\  \ \_\    \ \_\ \_\  \ \_____\ *
*   \/_/  \/_/   \/_____/   \/_____/   \/_/     \/_/\/_/   \/_____/ *
*********************************************************************
Welcome to MolPAL!
MolPAL will be run with the following arguments:
  batch_size: 0.01
  cache: False
  cluster: False
  config: None
  ddp: False
  delimiter: ,
  delta: 0.01
  distributed: False
  docked_ligand_file: None
  epsilon: 0.0
  fingerprint: pair
  fps: None
  init_size: 0.01
  k: 0.01
  length: 2048
  library: libraries/Enamine50k.csv.gz
  lookup_data_col: 1
  lookup_path: data/Enamine50k_scores.csv.gz
  lookup_sep: ,
  lookup_smiles_col: 0
  lookup_title_line: True
  m: 1.0
  max_depth: 8
  

#### Required Settings
The primary purpose of MolPAL is to accelerate virtual screens in a prospective manner. Currently (December 2020), MolPAL supports computational docking screens using the [`pyscreener`](https://github.com/coleygroup/pyscreener) library

`-o` or `--objective`: The objective function you would like to use. Choices include `docking` for docking objectives and `lookup` for lookup objectives. There are additional arguments for each type of objective.
- `docking`: given the variety of screening options allowed by the `pyscreener` library, it's likely easiest to specify an `--objective-config` rather than providing these options on the command line. The `objective-config` file must be provided in the format of a `pyscreener` configuration file, so some options might have different names (e.g., `size` in that file rather than `--box-size`). Any options specified on the command line will override any options provided in the configuration file. 
  * `--software`: the docking software you would like to use. Choices: 'vina', 'smina', 'psovina', 'qvina', and 'ucsfdock' (Default = 'vina').
  * `--receptor`': the filepath of the receptor you are attempting to dock ligands into.
  * `--box-center`: the x-, y-, and z-coordinates (Å) of the center of the docking box.
  * `--box-size`: the x-, y-, and z- radii of the docking box in Å.
  * `--docked-ligand-file`: the name of a file containing the coordinates of a docked/bound ligand. If using Vina-type software, this file must be a PDB format file. Either `--box-center` and `--box-size` must be specified or a docked ligand file must be provided. In the case that both are provided, 
  * `--score-mode`: the method by which to calculate an overall score from multiple scored conformations
- `lookup`
  * `--lookup-path`: the filepath of a CSV file containing score information for each input

`--library`: the filepath of a CSV file containing the virtual library as SMILES strings
- (optional) `--fps`: the filepath of an hdf5 file containing the precomputed fingerprints of your virtual library. MolPAL relies on the assumption that the ordering of the fingerprints in this file is exactly the same as that of the library file and that the encoder used to generate these fingerprints is exactly the same as the one used for model training. MolPAL handles writing this file for you if unspecified, so this option is mostly useful for avoiding the overhead at startup of running MolPAL again with the same library/encoder settings.

#### Optional Settings
MolPAL has a number of different model architectures, encodings, acquisition metrics, and stopping criteria to choose from. Many of these choices have default settings that were arrived at through hyperparameter optimization, but your circumstances may call for modifying these choices. To see the full list, run MolPAL with either the `-h` or `--help` flags. A few common options to specify are shown below.

`-k`: the fraction (if between 0 and 1) or number (if greather than 1) of top scores to evaluate when calculating an average. (Default = 0.005)

`--window-size` and `--delta`: the principle stopping criterion of MolPAL is whether or not the current top-k average score is better than the moving average of the `window_size` most recent top-k average scores by at least `delta`. (Default: `window_size` = 3, `delta` = 0.1)

`--max-explore`: if you would like to limit MolPAL to exploring a fixed fraction of the libary or number of inputs, you can specify that by setting this value. (Default = 1.0)

`--max-epochs`: Alternatively, you may specify the maximum number of epochs of exploration. (Default = 50)

`--model`: the type of model to use. Choices include `rf`, `gp`, `nn`, and `mpn`. (Default = `rf`)  
  - `--conf-method`: the confidence estimation method to use for the NN or MPN models. Choices include `ensemble`, `dropout`, `mve`, and `none`. (Default = 'none'). NOTE: the MPN model does not support ensembling

`--metric`: the acquisition metric to use. Choices include `random`, `greedy`, `ucb`, `pi`, `ei`, `thompson`, and `threshold` (Default = `greedy`.) Some metrics include additional settings (e.g. the β value for `ucb`.) 

#### GPU usage
MolPAL will automatically use a GPU if it detects one. If this is undesired, use the following command before running: `export CUDA_VISIBLE_DEVICES=''`

### Hyperparameter Optimization
While the default settings of MolPAL were chosen based on hyperparameter optimization with Optuna, they were calculated based on the context of structure-based discovery our computational resources. It is possible that these settings are not optimal for your particular problem. To adapt MolPAL to new circumstances, we recommend first generating a dataset that is representative of your particular problem then peforming hyperparameter optimization of your own using the `LookupObjective` class. This class acts as an Oracle for your particular objective function, enabling both consistent and near-instant calculation of the objective function for a particular input, saving time during hyperparameter optimization.

### Future Directions
Though MolPAL was originally intended for use with protein-ligand docking screens, it was designed with modularity in mind and is easily extendable to other settings as well. In principle, all that is required to adapt MolPAL to a new problem is to write a custom `Objective` subclass that implements the `calc` method. This method takes a sequence SMILES strings as an input and returns a mapping from SMILES string -> objective function value to be utilized by the Explorer. _To this end, we are currently exploring the extension of MolPAL to subsequent stages of virtual discovery (MD, DFT, etc.)_ If you make use of the MolPAL library by implementing a new `Objective` subclass, we would be happy to include your work in the main branch.

### Reproducing Experimental Results
#### Generating data
The data used in the original publication was generated through the corresponding configuration files located in `config_experiments` and the library name (e.g., '10k', '50k', 'HTS', or 'AmpC') as the two command line arguments. The submission script was designed to be used with a SLURM scheduler, but if you want to rerun the experiemnts on your machine, then you can simply follow the submission script logic to generate the proper command line arguments or write a new configuration file. The AmpC data was too large to include in this repo, but it may be downloaded from [here](https://figshare.com/articles/AmpC_screen_table_csv_gz/7359626).